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This  dissertation  investigates  the  use  of  hierarchy  and  abstraction  as  a  means  of  solving 
complex  sequential  decision  making  problems  such  as  those  with  continuous  state  and/or 
continuous  action  spaces,  and  domains  with  multiple  cooperative  agents.  This  thesis  de¬ 
velops  several  novel  extensions  to  hierarchical  reinforcement  learning  (HRL),  and  designs 
algorithms  that  are  appropriate  for  such  problems. 

It  has  been  shown  that  the  average  reward  optimality  criterion  is  more  natural  than  the 
more  commonly  used  discounted  criterion  for  continuing  tasks.  This  thesis  investigates  two 
formulations  of  HRL  based  on  the  average  reward  semi-Markov  decision  process  (SMDP) 
model,  both  for  discrete-time  and  continuous-time.  These  formulations  correspond  to  two 
notions  of  optimality  that  have  been  explored  in  previous  work  on  HRL:  hierarchical  op¬ 
timality  and  recursive  optimality.  Novel  discrete-time  and  continuous-time  algorithms, 


termed  hierarchically  optimal  average  reward  RL  (HAR)  and  recursively  optimal  av¬ 
erage  reward  RL  (RAR)  are  presented,  which  learn  to  find  hierarchically  and  recursively 
optimal  average  reward  policies.  Two  automated  guided  vehicle  (AGV)  scheduling  prob¬ 
lems  are  used  as  experimental  testbeds  to  empirically  study  the  performance  of  the  pro¬ 
posed  algorithms. 

Policy  gradient  reinforcement  learning  (PGRL)  methods  have  several  advantages  over 
the  more  traditional  value  function  RL  algorithms  in  solving  problems  with  continuous 
state  spaces.  However,  they  suffer  from  slow  convergence.  This  thesis  defines  a  family 
of  hierarchical  policy  gradient  RL  (HPGRL)  algorithms  for  scaling  PGRL  methods  to 
high-dimensional  domains.  In  HPGRL,  each  subtask  is  defined  as  a  PGRL  problem  whose 
solution  involves  computing  a  locally  optimal  policy.  Subtasks  are  formulated  in  terms  of 
a  parameterized  family  of  policies,  a  performance  function,  a  method  to  estimate  the  gra¬ 
dient  of  the  performance  function,  and  a  routine  to  update  the  policy  parameters  using  this 
gradient.  The  usually  slow  convergence  of  HPGRL  algorithms  is  improved  by  formulating 
high-level  subtasks,  which  usually  require  low-resolution  discretization  of  the  state  space 
and  have  finite  action  spaces,  as  value  function  RL  problems,  and  lower-level  subtasks, 
which  usually  require  high-resolution  discretization  of  the  state  space  and  may  have  infi¬ 
nite  action  spaces,  as  PGRL  problems.  This  family  of  algorithms  is  termed  hierarchical 
hybrid  algorithms.  The  effectiveness  of  the  proposed  algorithms  is  demonstrated  using  a 
taxi-fuel  problem  as  well  as  a  more  complex  continuous  state  and  action  ship  steering  task. 

This  thesis  also  examines  the  use  of  HRL  to  accelerate  policy  learning  in  coopera¬ 
tive  multi-agent  tasks.  The  use  of  hierarchy  speeds  up  learning  in  multi-agent  domains 
by  making  it  possible  to  leam  coordination  skills  at  the  level  of  subtasks  instead  of  prim¬ 
itive  actions.  Subtask-level  coordination  allows  for  increased  cooperation  skills  as  agents 
do  not  get  confused  by  low-level  details.  A  framework  for  hierarchical  multi-agent  RL 
is  developed  and  an  algorithm  called  Cooperative  HRL  is  presented  that  solves  coopera¬ 
tive  multi-agent  problems  more  efficiently.  This  algorithm  is  empirically  evaluated  using  a 


IX 


large  four-agent  AGV  scheduling  task.  The  framework  and  algorithm  is  extended  to  include 
communication  decisions.  The  goal  is  for  agents  to  leam  both  action  and  communication 
policies  that  together  optimize  the  task  given  the  communication  cost.  The  extended  al¬ 
gorithm,  called  COM-Cooperative  HRL,  is  a  hierarchical  multi-agent  RL  algorithm  with 
communication  decisions.  The  efficacy  of  this  algorithm  as  well  as  the  relation  between 
communication  cost  and  the  learned  communication  policy  is  demonstrated  using  a  multi¬ 
agent  taxi  problem. 

Together,  the  methods  and  algorithms  developed  in  this  dissertation  use  prior  knowl¬ 
edge  in  a  principled  way,  and  extend  HRL  to  solving  complex  sequential  decision  making 
problems  such  as  those  with  continuous  state  and/or  continuous  action  spaces  and  domains 
with  multiple  cooperative  agents. 
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CHAPTER  1 


INTRODUCTION 


Sequential  decision  making  under  uncertainty  is  one  of  the  fundamental  problems  in 
Artificial  Intelligence  (AI).  Many  sequential  decision  making  problems  can  be  modeled 
using  the  Markov  decision  process  (MDP)  formalism.  An  MDP  (Howard,  1960;  Puterman, 
1994)  models  a  system  that  we  are  interested  in  controlling  as  being  in  some  state  at  each 
time  step.  As  a  result  of  actions  the  agent  selects,  the  system  moves  through  some  sequence 
of  states  and  receives  a  sequence  of  rewards.  The  goal  is  to  select  actions  to  maximize  some 
measure  of  long-term  reward. 

Reinforcement  learning  (RL)  is  a  machine  learning  framework  for  solving  problems 
posed  in  the  MDP  formalism.  Despite  its  numerous  successes  in  a  number  of  different 
domains,  including  backgammon  (Tesauro,  1994),  job-shop  scheduling  (Zhang  and  Diet- 
terich,  1995),  dynamic  channel  allocation  (Singh  and  Bertsekas,  1996),  elevator  scheduling 
(Crites  and  Barto,  1998),  and  helicopter  flight  control  (Ng  et  al.,  2004),  current  RL  meth¬ 
ods  do  not  scale  well  to  high  dimensional  domains  —  they  can  be  slow  to  converge  and 
require  too  many  training  samples  to  be  practical  for  many  real-world  problems.  This  issue 
is  known  as  the  curse  of  dimensionality:  the  exponential  growth  of  the  number  of  param¬ 
eters  to  be  learned  with  the  size  of  any  compact  encoding  of  system  state  (Bellman,  1957). 
Recent  attempts  to  combat  the  curse  of  dimensionality  have  turned  to  principled  ways  of 
exploiting  abstraction  in  RL.  This  leads  naturally  to  hierarchical  control  architectures  and 
associated  learning  algorithms. 

Although  hierarchical  reinforcement  learning  (HRL)  approaches  exploit  the  power  of 
abstraction  and  scale  better  than  flat  RL  methods  to  high  dimensional  domains,  they  still 
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suffer  from  the  main  limitation  of  flat  RL  algorithms:  the  curse  of  dimensionality .  More¬ 
over,  HRL  methods  have  so  far  only  been  studied  in  a  narrow  context:  they  have  been  in¬ 
vestigated  for  the  discrete-time  discounted  reward  SMDP  model;  they  have  all  been  value 
function  RL  methods;  and,  they  have  only  been  studied  in  single-agent  domains. 

This  dissertation  expands  the  context  and  scope  of  HRL.  The  objective  here  is  to  de¬ 
velop  several  novel  extensions  to  existing  HRL  frameworks  and  design  algorithms  that  are 
appropriate  for  solving  complex  sequential  decision  making  problems  such  as  those  with 
continuous  state  and/or  continuous  action  spaces,  and  domains  with  multiple  cooperative 
agents. 

1.1  Motivation 

Many  problems  faced  by  animals  and  AI  systems  can  be  modeled  as  sequential  decision 
making  in  uncertain  dynamic  environments.  For  example,  a  complex  manufacturing  sys¬ 
tem,  e.g.,  a  system  for  manufacturing  automobile  or  personal  computers,  involves  optimiz¬ 
ing  hundreds  or  even  thousands  of  processes  (sub-systems)  such  as  inventory,  engineering 
design,  assembly,  and  marketing. 

These  problems  involve  decision  makers,  or  agents,  selecting  sequences  of  actions  in 
order  to  achieve  multiple  long-term  goals.  Moreover,  uncertainty  is  ever  present  in  these 
domains,  both  in  the  effects  of  actions,  and  in  the  evolution  of  the  actual  system.  The  un¬ 
certain  and  ever  changing  nature  of  these  problems  makes  it  difficult  to  plan  ahead  of  time. 
Hence,  these  tasks  require  control  rules,  which  are  dependent  on  the  state  variables  of  the 
system.  In  recent  years,  advances  in  technology  have  led  to  increased  interest  in  automated 
methods  for  solving  these  tasks.  Commercial  tools  are  now  available  for  problems  ranging 
from  factory  optimization  to  medical  diagnosis.  Unfortunately,  these  problems  tend  to  be 
very  complex,  and  most  of  the  existing  automated  techniques  either  build  on  heuristics, 
or  do  not  fully  address  the  long-term  or  the  uncertain  aspects  of  these  sequential  decision 
making  tasks. 
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Fortunately,  although  such  problems  are  very  complex,  they  are  often  hierarchically 
decomposable  into  a  set  of  simpler  subtasks.  As  argued  by  Simon  (1981)  in  “Architecture 
of  Complexity,”  many  complex  systems  have  a  decomposable  hierarchical  structure,  with 
the  subsystems  interacting  only  weakly  between  themselves.  Humans  exploit  this  decom¬ 
posable  hierarchical  structure  in  solving  such  complex  and  large-scale  problems. 

An  example  will  help  illustrate  the  basic  concepts.  This  example  has  been  chosen  be¬ 
cause,  it  involves  an  interesting  and  challenging  manufacturing  system,  and  furthermore 
several  versions  of  this  example  have  been  used  in  the  experiments  of  this  dissertation. 
Figure  1.1  shows  an  automated  guided  vehicle  (AGV)  scheduling  task.  AGVs  are  used 
in  flexible  manufacturing  systems  (FMSs)  for  material  handling  (Askin  and  Standridge, 
1993).  They  are  typically  used  to  pick  up  parts  from  one  location  and  drop  them  off  at  an¬ 
other  location  for  further  processing.  Locations  correspond  to  workstations  (M 1  to  M4)  or 
storage  locations  ( load  and  unload  stations).  Loads  that  are  released  at  the  drop-off  points 
{D 1  to  DA)  of  workstations  wait  at  their  pick-up  points  (PI  to  P4)  after  the  processing 
is  over,  so  the  AGV  is  able  to  take  them  to  the  warehouse  or  some  other  locations.  The 
pick-up  points  (PI  to  P4)  are  the  machine  or  workstations’  output  buffers.  Any  FMS  us¬ 
ing  AGVs  faces  the  problem  of  optimally  scheduling  the  paths  of  the  AGVs  in  the  system 
(Klein  and  Kim,  1996).  For  example,  a  move  request  occurs  when  a  part  finishes  at  a  work¬ 
station.  If  more  than  one  vehicle  is  empty,  the  vehicle  which  would  service  this  request 
needs  to  be  selected.  Also,  when  a  vehicle  becomes  available,  and  multiple  move  requests 
are  queued,  a  decision  needs  to  be  made  as  to  which  request  should  be  serviced  by  that 
vehicle.  These  schedules  obey  a  set  of  constraints  that  reflect  the  temporal  relationships 
between  activities  and  the  capacity  limitations  of  a  set  of  shared  resources.  The  system  per¬ 
formance  is  generally  measured  in  terms  of  throughput,  on-line  inventory,  and  AGV  travel 
time,  but  throughput  is  by  far  the  most  important  factor.  Throughput  is  measured  in  terms 
of  the  number  of  finished  assemblies  deposited  at  the  unloading  deck  per  unit  time.  Since 
this  problem  is  very  complex,  various  heuristics  and  their  combinations  are  generally  used 
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to  schedule  AGVs  (Klein  and  Kim,  1996).  However,  the  heuristics  perform  poorly  when 
the  constraints  on  the  movement  of  the  AGVs  are  reduced. 
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Figure  1.1.  An  AGV  scheduling  domain  with  four  machines  Ml  to  M 4.  AGVs  are  respon¬ 
sible  to  carry  raw  materials  and  finished  parts  between  the  machines  and  the  warehouse. 


In  order  for  an  AGV  to  optimize  this  task,  it  must  learn  all  its  sub-tasks  such  as  carry 
parts  from  load  station  to  machines,  deliver  assemblies  from  machines  to  unload  station  at 
the  warehouse,  navigate  to  load  and  unload  stations,  plus  it  should  leam  the  order  to  execute 
these  sub-tasks.  The  state  space  of  this  task  consists  of  AGV’s  status  and  location,  status  of 
input  and  output  buffers  of  workstations,  and  the  availability  of  parts  in  warehouse,  which 
can  become  enormous.  It  makes  it  very  difficult  for  flat  (non-hierarchical)  RL  methods  to 
be  used  in  this  problem  as  we  will  show  in  Chapters  4  and  6. 

However,  the  AGV  scheduling  task  described  above  is  naturally  decomposed  to  a  set 
of  non-primitive  subtasks  like  deliver  material  to  workstations  (DM  1  to  DMA),  deliver 
assembly  from  workstations  to  warehouse  (DAI  to  DAA),  navigate  to  the  load  station 
at  the  warehouse  (NavLoad),  navigate  to  the  drop-off  points  of  workstations  (NavPutl  to 
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NavPut4),  navigate  to  the  pick-up  points  of  workstations  ( NavPickl  to  NavPick.4),  navigate 
to  the  unload  station  at  the  warehouse  ( NavUnload ),  and  a  set  of  primitive  subtasks  such 
as  load,  put,  pick,  unload,  left,  forward,  and  right.  These  are  the  subtasks  that  are  naturally 
important  in  solving  the  AGV  scheduling  task.  The  designer  of  the  system  uses  her/his 
domain  knowledge  to  put  the  primitive  and  non-primitive  subtasks  of  the  AGV  scheduling 
problem  together  and  builds  a  hierarchical  task  decomposition  like  the  one  shown  in  Figure 
1.2.  This  hierarchical  decomposition  can  later  be  used  by  HRL  algorithms  such  as  hierarchy 
of  abstract  machines  (HAMs)  (Parr,  1998),  options  (Sutton  et  al.,  1999;  Precup,  2000), 
MAXQ  (Dietterich,  2000),  and  programmable  HAMs  (PHAMs)  (Andre  and  Russell,  2001; 
Andre,  2003)  to  optimize  the  AGV  scheduling  problem.  Using  of  hierarchical  RL  methods 
leads  to  faster  convergence  and  better  performance  than  the  flat  algorithms  as  we  will  show 
for  MAXQ  in  this  thesis. 


Figure  1.2.  A  hierarchical  task  decomposition  for  an  AGV  scheduling  problem. 


These  HRL  algorithms  find  the  hierarchically  or  recursively  optimal  discounted  reward 
policy  for  the  AGV  scheduling  problem  when  the  number  of  states  is  finite.  However  as 
we  mentioned  earlier,  even  HRL  algorithms  suffer  from  the  curse  of  dimensionality.  It  will 
take  a  long  time  and  require  too  many  samples  for  them  to  converge  if  the  state  space  of 
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the  system  grows.  It  raises  several  important  questions  such  as:  1)  Is  the  discounted  reward 
optimality  the  most  suitable  optimality  criterion  for  this  task?  If  it  is  not,  is  it  possible 
to  design  HRL  algorithms  to  find  a  more  appropriate  optimal  policy  for  this  problem?  2) 
Consider  the  continuous  state  and  action  version  of  the  AGV  scheduling  problem,  when  the 
AGV  must  leam  to  navigate  using  low-level  continuous  commands  instead  of  directional 
actions  such  as  forward  or  left ,  and  it  has  continuous  sensors  instead  of  only  viewing  the 
world  as  a  discrete  grid.  Are  the  existing  HRL  algorithms  still  able  to  solve  the  problem 
efficiently?  3)  Consider  the  multi-agent  version  of  the  AGV  scheduling  problem  where 
there  are  several  AGVs  in  the  environment  cooperating  with  each  other  to  carry  parts  to 
workstations  and  bring  assemblies  from  workstations  back  to  the  warehouse.  The  number 
of  states  and  actions,  and  as  a  result  the  number  of  parameters  to  be  learned,  increases 
dramatically  with  the  number  of  agents  (AGVs).  Does  the  nature  of  cooperative  multi¬ 
agent  problems  allow  us  to  design  more  efficient  HRL  algorithms  for  these  domains?  These 
are  the  types  of  the  questions  that  we  try  to  address  in  this  dissertation.  We  briefly  describe 
how  we  address  the  above  questions  in  the  next  section,  and  leave  the  more  elaborative 
discussion  for  later  chapters. 

1.2  Our  Approach 

Prior  work  in  HRL  including  HAMs,  options,  MAXQ,  and  PHAMs  has  been  limited  to 
the  discrete-time  discounted  reward  SMDP  model.  However,  the  average  reward  optimality 
criterion  is  generally  more  appropriate  in  modeling  cyclical  control  and  optimization  tasks, 
such  as  queuing,  scheduling,  and  flexible  manufacturing.  We  investigate  two  formulations 
of  HRL  based  on  the  average  reward  SMDP  model,  both  for  discrete-time  and  continuous¬ 
time.  These  formulations  correspond  to  two  notions  of  optimality  that  have  been  previously 
explored  in  HRL:  hierarchical  optimality  and  recursive  optimality  (Dietterich,  2000).  We 
present  algorithms  that  learn  to  find  hierarchically  and  recursively  optimal  average  reward 
policies  under  discrete-time  and  continuous-time  average  reward  SMDP  models.  We  call 
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them  hierarchically  optimal  average  reward  RL  (HAR)  and  recursively  optimal  aver¬ 
age  reward  RL  (RAR)  algorithms. 

Existing  HRL  approaches  are  limited  to  value  function  RL  (VFRL)  methods.  How¬ 
ever,  there  are  only  weak  theoretical  guarantees  on  the  performance  of  VFRL  algorithms 
on  problems  with  large  or  continuous  state  spaces.  Policy  gradient  RL  (PGRL)  methods 
have  demonstrated  better  performance  in  problems  with  continuous  state  and/or  continuous 
action  spaces  (Marbach,  1998;  Baxter  et  al.,  2001).  We  propose  a  family  of  hierarchical 
policy  gradient  RL  (HPGRL)  algorithms  that  exploit  both  the  power  of  abstraction,  and 
the  efficiency  of  PGRL  methods  in  continuous  state  and/or  continuous  action  problems. 
However,  they  suffer  from  slow  convergence  of  PGRL  algorithms.  Consider  the  continu¬ 
ous  state  and  action  version  of  the  AGV  scheduling  task  again.  The  low-level  subtasks  such 
as  NavUnload  are  now  continuous  state  and  action  problems.  The  AGV  needs  to  know  its 
exact  location  and  selects  its  action  among  infinite  number  of  possibilities  in  order  to  solve 
these  low-level  continuous  state  and  action  subtasks.  In  contrast,  when  AGV  decides  at  the 
high-level  in  the  hierarchy,  for  instance  to  choose  between  delivering  material  to  or  from 
machines,  it  needs  only  a  rough  estimate  of  its  location.  Additionally,  the  AGV  selects  its 
action  among  only  eight  possible  choices  (DM  1  to  DMA  and  DAI  to  DAA).  We  acceler¬ 
ate  learning  of  HPGRL  algorithms  by  formulating  high-level  subtasks,  which  usually  have 
smaller  state  and  finite  action  spaces  as  VFRL  problems,  and  low-level  subtasks  such  as 
NavUnload  with  infinite  state  and/or  action  spaces  as  PGRL  problems.  We  call  this  family 
of  algorithms  hierarchical  hybrid  algorithms. 

Finally,  we  examine  the  use  of  HRL  to  accelerate  policy  learning  in  cooperative  multi¬ 
agent  tasks.  The  nature  of  cooperative  multi-agent  problems  allows  for  more  efficient  use 
of  HRL  methods.  Consider  the  multi-agent  version  of  the  AGV  scheduling  task  again.  In 
our  approach,  AGVs  use  the  same  hierarchical  task  decomposition.  Learning  is  decentral¬ 
ized,  with  each  agent  learning  three  interrelated  skills.  First,  how  to  perform  subtasks  such 
as  deliver  material  to  machine  Ml  (DM  1)  or  navigation  to  unload  station  (NavUnload). 
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Second,  the  order  to  do  the  subtasks,  for  instance  go  to  the  load  station  and  pick  up  part 
1  before  heading  to  workstation  Ml.  Third,  how  to  coordinate  with  other  agents,  AGV 
1  can  carry  part  for  workstation  M 1  while  AGV  2  makes  the  output  buffer  of  M 1  empty. 
The  use  of  hierarchy  allows  AGVs  to  learn  more  efficiently  by  making  it  possible  to  leam 
coordination  skills  at  the  level  of  subtasks  instead  of  primitive  actions.  Subtask-level  coor¬ 
dination  allows  for  increased  cooperation  skills  as  agents  do  not  get  confused  by  low-level 
details.  Each  AGV  learns  high-level  coordination  knowledge  (e.g.,  what  is  the  utility  of 
AGV  1  carrying  part  to  machine  Ml  if  AGV  2  is  bringing  assembly  back  from  machine 
M3),  rather  than  it  learns  its  response  to  low-level  primitive  actions  of  other  AGVs  (e.g.,  if 
AGV  1  goes  forward,  what  should  AGV  2  do). 

In  addition  to  the  curse  of  dimensionality,  multi-agent  learning  suffers  from  partial  ob¬ 
servability.  Even  if  an  agent  has  complete  observability  of  its  own  state,  states  and  actions 
of  other  agents  are  not  fully  observable.  One  way  to  address  partial  obsen’ability  in  dis¬ 
tributed  multi-agent  domains  is  to  use  communication  to  exchange  required  information. 
However,  communication  is  usually  costly,  which  requires  agents  to  optimize  their  commu¬ 
nication  policy  in  addition  to  their  action  policy.  A  further  advantage  of  the  use  of  temporal 
abstraction  in  cooperative  multi-agent  learning  is  that  AGVs  now  communicate  at  the  level 
of  subtasks  (temporally  extended  actions)  instead  of  primitive  actions.  Since  subtasks  can 
take  a  long  time  to  complete,  communication  is  needed  only  fairly  infrequently. 

In  this  research,  we  introduce  a  hierarchical  multi-agent  RL  framework  and  present  two 
algorithms  called  Cooperative  HRL  and  COM-Cooperative  HRL.  In  Cooperative  HRL 
algorithm,  we  assume  communication  is  free.  In  COM-Cooperative  HRL  algorithm,  we 
assume  communication  is  costly,  and  agents  leam  both  action  and  communication  policies 
that  together  optimize  the  task  given  the  communication  cost.  Of  course,  it  makes  COM- 
Cooperative  HRL  slower  than  Cooperative  HRL  due  to  more  parameters  that  must  be 
learned. 
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1.3  Contributions 


The  main  contributions  of  this  dissertation  are  summarized  below. 

Hierarchical  Reinforcement  Learning 

•  We  have  developed  a  general  hierarchical  reinforcement  learning  (HRL)  framework 
for  simultaneous  learning  of  policies  at  multiple  levels  of  the  hierarchy.  This  frame¬ 
work  is  a  generalization  of  existing  HRL  approaches  especially  the  MAXQ  value 
function  decomposition  (Dietterich,  2000).  In  our  framework,  we  apply  the  three- 
part  value  function  decomposition  (Andre  and  Russell,  2002)  to  guarantee  hierar¬ 
chical  optimality,  and  use  reward  shaping  (Ng  et  al.,  1999)  to  reduce  the  burden  of 
exploration,  thereby  extending  the  MAXQ  method. 

Hierarchical  Average  Reward  Reinforcement  Learning 

•  We  extend  previous  work  on  hierarchical  reinforcement  learning  (HRL)  to  the  aver¬ 
age  reward  SMDP  model,  and  investigate  hierarchical  and  recursive  optimalities  in 
hierarchical  average  reward  RL. 

-  We  have  developed  new  discrete-time  and  continuous-time  hierarchically  opti¬ 
mal  average  reward  RL  (HAR)  algorithms.  The  aim  of  these  algorithms  is  to 
find  a  hierarchical  policy  with  highest  global  gain. 

-  We  have  developed  new  discrete-time  and  continuous-time  recursively  optimal 
average  reward  RL  (RAR)  algorithms.  In  these  algorithms,  we  treat  subtasks  as 
continuing  average  reward  problems,  where  the  goal  at  each  subtask  is  to  max¬ 
imize  its  gain  given  the  policies  of  its  children.  We  investigate  the  optimality 
achieved  by  the  RAR  algorithm  and  illustrate  the  conditions  under  which  the 
policy  learned  by  this  algorithm  at  each  subtask  is  independent  of  the  context 
in  which  it  is  executed  and  therefore  can  be  reused  by  other  hierarchies. 
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•  We  empirically  demonstrate  the  effectiveness  and  the  type  of  optimality  achieved  by 
HAR  and  RAR  algorithms  using  two  AGV  scheduling  tasks. 

Hierarchical  Policy  Gradient  Reinforcement  Learning 

•  We  have  developed  a  family  of  hierarchical  policy  gradient  RL  (HPGRL)  algorithms 
for  scaling  policy  gradient  reinforcement  learning  methods  to  problems  with  contin¬ 
uous  (or  large  discrete)  state  and/or  action  spaces. 

•  We  present  a  family  of  hierarchical  hybrid  algorithms  to  accelerate  learning  in  HP¬ 
GRL  algorithms.  In  hierarchical  hybrid  algorithms,  we  formulate  high-level  sub¬ 
tasks,  which  usually  require  low-resolution  discretization  of  the  state  space  and  have 
finite  action  spaces  as  value  function  RL  problems,  and  low-level  subtasks,  which 
usually  require  high-resolution  discretization  of  the  state  space  and  may  have  infinite 
action  spaces  as  policy  gradient  RL  problems. 

•  We  empirically  demonstrate  the  performance  of  hierarchical  hybrid  algorithms  using 
a  continuous  state  and  action  ship  steering  problem. 

Hierarchical  Multi- Agent  Reinforcement  Learning 

•  We  extend  the  SMDP  model  to  cooperative  multi-agent  domains  and  present  the 
multi-agent  SMDP  (MSMDP)  model. 

•  We  have  developed  a  hierarchical  cooperative  multi-agent  RL  framework  in  which 
agents  learn  coordination  faster  by  sharing  information  at  the  level  of  subtasks,  rather 
than  attempting  to  leam  coordination  at  the  level  of  primitive  actions. 

•  We  employ  this  hierarchical  cooperative  multi-agent  RL  framework,  and  present  a 
hierarchical  multi-agent  RL  algorithm  called  Cooperative  HRL. 

•  We  empirically  demonstrate  the  effectiveness  of  the  Cooperative  HRL  algorithm  us¬ 
ing  a  large  four-agent  AGV  scheduling  problem. 
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•  We  extend  the  Cooperative  HRL  algorithm  to  include  communication  decisions,  and 
present  a  hierarchical  multi-agent  RL  algorithm  called  COM-Cooperative  HRL.  This 
algorithm  is  designed  to  learn  both  action  and  communication  policies  that  together 
optimize  the  task  given  the  communication  cost. 

•  We  empirically  demonstrate  the  effectiveness  of  the  COM-Cooperative  HRL  algo¬ 
rithm  using  a  multi-agent  taxi  problem. 

1.4  Outline 

The  remainder  of  this  thesis  is  organized  as  follows: 

Chapter  2:  We  present  the  foundational  background  material  for  the  dissertation.  We  be¬ 
gin  by  describing  the  reinforcement  learning  (RL)  problem  and  formalizing  the  Markov 
decision  process  (MDP)  and  semi-Markov  decision  process  (SMDP)  frameworks  under 
different  optimality  criteria.  We  also  review  some  of  the  key  ideas  and  solution  methods 
of  MDPs  and  SMDPs.  We  discuss  some  of  the  difficulties  of  solving  MDPs  for  problems 
with  large  state  spaces.  Then  we  briefly  review  the  historical  development  of  hierarchy  and 
temporal  abstraction  in  artificial  intelligence  (AI),  control  theory,  and  RL.  In  this,  we  es¬ 
pecially  emphasize  hierarchical  reinforcement  learning  (HRL)  and  the  main  concepts  and 
algorithms  in  this  framework.  Finally,  we  present  a  brief  overview  of  the  growing  field  of 
multi-agent  reinforcement  learning.  In  this  chapter,  we  also  introduce  the  notation  that  will 
be  used  in  this  dissertation. 

Chapter  3:  We  present  a  general  framework  for  hierarchical  reinforcement  learning  (HRL) 
which  is  used  in  the  algorithms  proposed  in  this  dissertation.  We  also  illustrate  the  basic 
concepts  of  HRL  such  as  policy  execution,  hierarchical  and  recursive  optimality,  and  value 
function  definitions  and  decompositions  in  this  chapter. 
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Chapter  4:  We  present  hierarchically  optimal  average  reward  RL  (HAR)  and  recursively 
optimal  average  reward  RL  (RAR)  algorithms  for  both  discrete  and  continuous  time  SMDP 
models.  We  investigate  the  conditions  under  which  the  policy  learned  by  the  RAR  algo¬ 
rithm  at  each  subtask  is  independent  of  the  context  in  which  it  is  executed  and  therefore 
can  be  reused  by  other  hierarchies.  We  use  two  AGV  tasks  to  demonstrate  the  performance 
and  the  type  of  optimality  achieved  by  these  algorithms. 

Chapter  5:  We  first  present  a  family  of  hierarchical  policy  gradient  RL  (HPGRL)  al¬ 
gorithms  and  compare  their  performance  with  hierarchical  value  function  RL  (VFRL)  al¬ 
gorithms  in  a  simple  taxi-fuel  problem.  We  then  show  how  learning  can  be  accelerated 
in  HPGRL  algorithms  by  using  both  value  function  and  policy  gradient  RL  formulations 
in  a  hierarchy,  and  propose  a  family  of  hierarchical  hybrid  algorithms.  We  empirically 
demonstrate  the  performance  of  a  hierarchical  hybrid  algorithm  using  a  continuous  state 
and  action  ship  steering  problem. 

Chapter  6:  We  investigate  the  use  of  hierarchical  reinforcement  learning  (HRL)  to  speed 
up  the  acquisition  of  cooperative  multi-agent  tasks.  We  first  extend  the  SMDP  model  to 
cooperative  multi-agent  domains  and  present  the  multi-agent  SMDP  (MSMDP)  model.  We 
use  this  model  and  present  a  hierarchical  cooperative  multi-agent  RL  framework.  We  then 
use  this  hierarchical  cooperative  multi-agent  RL  framework,  and  propose  two  hierarchi¬ 
cal  cooperative  multi-agent  RL  algorithms  called  Cooperative  HRL  and  COM-Cooperative 
HRL.  While  the  Cooperative  HRL  algorithm  assumes  that  communication  is  free,  in  the 
COM-Cooperative  HRL  algorithm,  agents  learn  both  action  and  communication  policies 
that  together  optimize  the  task  given  the  communication  cost.  The  effectiveness  of  the 
Cooperative  HRL  algorithm  is  empirically  demonstrated  using  a  large  four-agent  AGV 
scheduling  problem.  We  also  empirically  demonstrate  the  efficacy  of  the  COM-Cooperative 
HRL  algorithm  as  well  as  the  relation  between  the  communication  cost  and  the  learned 
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communication  policy  using  a  multi-agent  taxi  problem. 


Chapter  7:  We  summarize  the  dissertation  and  discuss  directions  for  future  research. 


Appendix:  We  define  a  table  of  the  symbols  used  in  this  dissertation. 
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CHAPTER  2 


BACKGROUND  AND  NOTATION 


In  this  chapter,  we  describe  the  reinforcement  learning  (RL)  problem  and  introduce  the 
Markov  decision  process  (MDP)  and  semi-Markov  decision  process  (SMDP)  formalisms 
under  different  optimality  criteria.  We  also  present  some  of  the  key  ideas  and  solution 
methods  of  MDPs  and  SMDPs.  Then  we  review  the  historical  development  of  hierarchy 
and  temporal  abstraction  in  artificial  intelligence  (AI),  control  theory,  and  RL.  In  this,  we 
especially  emphasize  hierarchical  reinforcement  learning  (HRL)  and  the  main  concepts  and 
algorithms  in  this  framework.  Finally,  we  present  a  brief  overview  of  the  growing  field  of 
multi-agent  reinforcement  learning.  In  doing  so,  we  also  introduce  the  notation  that  will  be 
used  in  the  remainder  of  this  dissertation. 

Throughout  this  chapter  we  present  the  standard  body  of  background  work  in  the  field. 
For  more  comprehensive  introduction  to  MDPs,  SMDPs,  and  RL,  readers  may  also  refer  to 
standard  texts  such  as  (Howard,  1960,  1971;  Puterman,  1994;  Bertsekas,  1995;  Bertsekas 
and  Tsitsiklis,  1996;  Sutton  and  Barto,  1998)  or  the  survey  by  Kaelbling  et  al.  (Kaelbling 
et  al.,  1996).  Barto  and  Mahadevan  (2003)  provides  more  detailed  introduction  to  HRL. 

2.1  Reinforcement  Learning 

Reinforcement  learning  (RL)  (Sutton  and  Barto,  1998)  refers  to  a  collection  of  meth¬ 
ods  that  allow  an  agent  (a  system)  to  leam  how  to  make  good  decisions  by  observing  its 
own  behavior,  and  improves  its  actions  through  a  reinforcement  mechanism.  There  are 
many  formal  specifications  of  this  kind  of  problems  that  have  been  developed  over  the  last 
fifty  years.  The  most  commonly  used  is  the  Markov  decision  processes  (MDPs).  An 
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MDP  assumes  that  the  agent  has  full  access  to  the  state  of  the  world  and  each  of  its  ac¬ 
tions  takes  a  single  time  step.  Semi-Markov  decision  processes  (SMDPs)  relax  the  latter 
assumption  and  allow  actions  that  take  several  time  steps.  Finally,  partially  observable 
Markov  decision  processes  (POMDPs)  relax  the  former  assumption  by  allowing  the  agent 
to  receive  observations  that  do  not  necessarily  reveal  the  entire  state  of  the  environment. 
When  a  problem  is  modeled  using  one  of  the  above,  the  goal  of  an  RL  method  is  to  find  a 
good  (possibly  optimal)  policy  for  the  model.  We  will  cover  MDPs  and  SMDPs  in  detail 
in  Sections  2.2  and  2.3.  POMDPs  will  be  presented  more  briefly,  as  the  subject  of  partial 
observability  is  almost  (but  not  completely)  orthogonal  to  the  main  contributions  of  this 
dissertation. 

2.2  Markov  Decision  Processes 

Markov  decision  processes  (MDPs)  (Howard,  1960;  Puterman,  1994)  are  model  for 
sequential  decision  making  when  outcomes  are  uncertain.  There  are  many  possible  ways 
of  defining  MDPs,  and  many  of  these  definitions  are  equivalent  up  to  small  transforma¬ 
tions  of  the  problem.  One  definition  is  that  an  MDP  model  M.  consists  of  five  elements 
(< S ,  A,  V,  7 Z,  I)  defined  as  follows:1 

•  S\  is  the  set  of  states  of  the  world. 

•  A:  is  the  set  of  possible  actions  from  which  the  agent  (controller)  may  choose  on  at 
each  decision  epoch. 

•  V  :  S  x  A  x  S  — >  [0, 1]:  is  the  transition  probability  function  with  P(s'\s,  a)  being 
the  probability  of  transition  to  state  s'  when  agent  takes  action  a  in  state  s. 

'in  non-discrete  settings  (when  the  set  of  states  S  and  the  set  of  actions  A  are  not  discrete),  many  subtle 
mathematical  issues  arise,  which  are  not  in  the  scope  of  this  dissertation.  For  more  details  see  (Howard,  1960; 
Puterman,  1994). 
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•  7Z  :  S  x  A  — >  IR:  is  the  reward  function  with  r(s,  a)  being  the  reward  that  agent 
receives  when  it  takes  action  a  in  state  s. 

•  I  :  S  — >  [0, 1]:  is  the  initial  state  distribution. 

The  qualifier  “Markov”  is  used  because  the  transition  probability  and  reward  functions 
depend  on  the  past  only  through  the  current  state  of  the  system  and  the  action  selected  by 
the  decision  maker  in  that  state.  Since  it  may  not  be  possible  for  the  agent  to  take  every 
action  at  each  state  s,  we  define  As  C  A  as  the  set  of  admissible  actions  in  state  s.  Events 
in  an  MDP  proceed  as  follows.  The  agent  begins  in  an  initial  state  so  drawn  from  the  initial 
distribution  I.  At  each  time  t,  the  agent  observes  the  state  of  the  environment  st  G  S, 
selects  an  action  at  G  ASt,  as  a  result  of  which  the  state  of  the  system  transitions  to  some 
state  st+i  G  S  drawn  from  the  transition  probability  function  P(st+i|st,  at ),  and  the  agent 
receives  reward  r(st,  at). 

The  method  of  specifying  an  agent’s  behavior  in  an  MDP  is  called  a  policy.  A  policy 
can  be  stationary,  in  which  case  it  is  a  stochastic  mapping  from  states  to  actions,  but  it  can 
also  be  non-stationary  and  depend  on  other  factors  such  as  the  agent’s  memory  or  internal 
state.  A  stationary  policy,  /i,  can  be  deterministic,  in  which  case  it  is  a  mapping  from 
states  to  actions  n  :  S  — >  A,  or  stochastic,  in  which  case  it  is  a  probability  distribution 
over  state-action  pairs  /t  :  S  x  A  — >  [0, 1]  and  ^ae_4s  fi(a\s)  =  1  for  all  s  G  S,  where 
n(a\s)  represents  the  probability  that  policy  //  selects  action  a  in  state  s. 

Now  the  question  arises  of  the  quality  of  a  given  policy.  There  are  many  ways  of 
defining  optimality,  but  typically  the  quality  or  value  of  a  policy  is  based  on  a  function 
of  the  future  rewards.  In  Sections  2.2.1,  2.2.2,  and  2.2.3,  we  examine  several  popular 
optimality  criteria  in  the  MDP  literature. 

2.2.1  Undiscounted  Reward  Markov  Decision  Processes 

In  episodic  tasks,  the  environment  has  one  or  more  absorbing  terminal  states.  All  tran¬ 
sitions  from  an  absorbing  terminal  state  lead  back  into  the  same  state  with  probability  1.0 
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and  reward  0.  Typically  in  this  setting,  the  goal  is  to  maximize  the  expected  undiscounted 
sum  of  rewards  ^^Lg1  r  (st,  at),  where  N  is  the  number  of  time  steps  taken  before  reaching 
an  absorbing  state.  We  usually  consider  only  policies  that  are  proper  in  that  all  policies 
reach  an  absorbing  terminal  state  with  probability  1.0  (Bertsekas  and  Tsitsiklis,  1996). 

In  infinite-horizon  setting  where  the  agent  may  take  an  infinite  number  of  steps,  the 
undiscounted  sum  of  rewards  can  be  infinite.  To  avoid  this,  discounted  and  average  reward 
optimality  criteria  are  often  used,  which  we  describe  them  in  the  next  two  sections. 

2.2.2  Discounted  Reward  Markov  Decision  Processes 

In  discounted  reward  MDPs,  near-term  rewards  are  weighted  more  than  distant  re¬ 
wards.  In  this  setting,  the  agent’s  goal  is  to  maximize  YltLo  7 tr(st,  at)-  This  sum  is  finite  if 
the  discount  factor  0  <  7  <  1,  and  all  rewards  are  bounded.  Note  that  the  episodic  prob¬ 
lems  can  be  folded  into  this  setting  —  if  all  policies  are  proper  and  we  use  a  discount  factor 
of  7  =  1,  the  undiscounted  sum  of  rewards  of  an  episodic  task  remains  finite  (Bertsekas 
and  Tsitsiklis,  1996;  Sutton  and  Barto,  1998). 

In  the  infinite-horizon  discounted  reward  setting,  the  value  function  for  a  policy  /i, 

:  S  — >  IR,  is  a  mapping  from  states  to  their  values  under  policy  //.  The  value  of  state  s 
under  policy  11  expresses  the  expected  discounted  sum  of  future  rewards  starting  from  state 
s  and  following  policy  //  thereafter.  Formally,  we  define  the  value  function  of  a  policy  as 


V^(s)  =E  [r(s0,  (J,(s0))  +  77- (si,  n(si))  +  7 2r(s2,  fJ>(s2))  +  . . .  | s0  =  s,n 


=E 


5^7Mst,at)|s0 

_t= 0 


s,  p, 


We  can  relate  the  values  of  different  states  using  what  are  known  as  the  Bellman  equations 
(Bellman,  1957).  These  equations  relate  each  state  to  its  possible  successor  states. 


vu(s)  = 


r(s,  a)  +  7  ^  P(s'\s,  a)V^(s') 
s'es 


(2.1) 
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A  policy  /i  is  optimal  if,  for  all  states,  its  value  is  at  least  as  high  as  the  value  of  any  other 
policy.  It  is  known  (Blackwell,  1962)  that  there  exists  a  deterministic  optimal  policy  for 
infinite-horizon  discounted  reward  MDPs.  The  optimal  policy  /i*  is  specified  as 


H*  (s)  =  arg  max 
a£Aa 


r(s,  a )  +  7  P(s'\si  a)V*(s') 
s'es 


where  V*  is  the  optimal  value  function,  the  value  function  of  the  optimal  policy.  Bellman 
proved  that  the  optimal  value  function  is  the  solution  to  the  following  equation: 


I/*(s)  =  max 


max 

a£As 


r(s,  a)  +  7  ^2  P(s'\si  a)V*(s) 
s'es 


(2.2) 


Similarly,  the  action-value  function  of  a  policy  /i,  Q11  :  S  x  A  — >  IR,  is  defined  as  a 
mapping  from  state-action  pairs  to  their  values.  The  action- value  function  Q'‘  {s.  a)  for  a 
policy  //  is  the  expected  sum  of  discounted  future  rewards  for  taking  action  a  in  state  s  and 
then  following  policy  //. 


Q^(s,a) 


E 


OO 

y^7fr(sf,af)|ao  =  s,a0  =  a,  n 
_t= o 


Note  that  V^(s)  =  Qfl(s,  The  Bellman  equation  for  the  action-value  function  Q M 

can  be  written  as 


Q^(s,  a)  =  r(s,  a)  +  ^^2  P(s'\s^  a)  ^2  Ma1 s')Qtl{s' ,o!) 

sf  g/£vAs/ 


and  the  optimal  action-value  function  Q*  is  the  solution  to  the  Bellman  optimality  equa¬ 
tion  for  action- value  function  defined  as  follows: 


Q*(s,a) 


ma xQ^(s,  a) 


'&S 


max  Q* 

a'eA,/ 


(s',  a') 


(2.3) 
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The  Bellman  Equations  2.2  and  2.3  are  related  by: 


V*(s)  =  maxQ*(s,a) 

An  alternative  way  of  defining  the  optimal  value  function  is  based  on  the  Bellman  operator 
T*  (Bertsekas,  1995)  defined  as 


=  max  Q^l(s,  a) 

The  optimal  value  function  V*  is  the  fixed  point  of  V*  =  T*V*. 

2.2.3  Average  Reward  Markov  Decision  Processes 

Discounted  optimization  is  motivated  by  domains  where  reward  can  be  interpreted  as 
money  that  can  earn  interest,  or  where  there  is  a  fixed  probability  that  a  run  will  be  termi¬ 
nated  at  any  given  time.  However,  many  domains  do  not  have  either  of  these  properties. 
Discounting  in  such  domains  tends  to  sacrifice  long-term  rewards  in  favor  of  short-term 
rewards.  Moreover,  in  general,  the  discounted  optimal  policy  depends  on  the  choice  of  the 
value  of  the  discount  factor  7.  For  instance,  consider  the  MDP  of  Figure  2.1  from  Schwartz 
(1993).  Here,  any  undiscounted  reward  method  will  clearly  choose  action  a\  in  state  .7 . 
But  for  any  7  <  fjjf  ~  0.998,  QM(s  1,  a2)  >  «i)  regardless  of  policy  /i.  In  fact,  given 

any  7,  there  is  some  value  we  can  set  for  r(s  1,  a2)  which  makes  the  discounted  criterion 
favor  action  a2  over  action  a  \ . 

It  is  true  that  for  any  finite  MDP  (an  MDP  with  finite  state  and  action  spaces)  there  is 
some  sufficiently  large  7  for  which  the  discounted  and  undiscounted  measures  agree.  How¬ 
ever,  proper  choice  of  such  a  7  requires  detailed  knowledge  of  the  domain  —  the  knowledge 
that  we  do  not  want  to  presuppose.  Even,  with  such  knowledge,  a  parameter  such  as  7  that 
needs  to  be  tailored  to  suit  individual  domains  is  clearly  undesirable.  Therefore,  the  agent 
may  prefer  to  compare  policies  on  the  basis  of  their  average  expected  reward  instead  of 
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Figure  2.1.  An  MDP  on  which  discounted  and  undiscounted  measures  may  disagree. 


their  expected  discounted  reward.  The  aim  of  the  average  reward  MDP  is  to  compute  poli¬ 
cies  that  yield  the  highest  expected  payoff  per  step.  The  average  reward  or  gain  associated 
with  a  particular  policy  n,  g^L,  is  defined  as 

r 

when  the  state  space  of  the  MDP,  S,  is  finite  or  countable.  Equation  2.4  can  be  written  as 

TV-1 

^0)  =  lim  T7  J^(PM)V(s,/r(s))  =  P\(s,  /r(s))  (2.5) 

N—> oo  iV  * 

t= 0 

where  P1'  and  P1  =  lini  v^oc  Silo  are  the  transition  probability  matrix  and  the 

limiting  matrix  of  policy  //  respectively.2 

A  key  observation  that  greatly  simplifies  the  design  of  the  average  reward  algorithms 
is  that  for  unichain  MDPs,3  the  average  reward  of  any  policy  is  state  independent,  that  is 
g^(s)  =  g Vs  £  S.  From  now  on  in  this  section  we  assume  that  MDPs  are  unichain. 

2The  limiting  matrix  P  satisfi  es  the  equality  PP  =  P. 

2 MDPs  in  which  every  stationary  policy  gives  rise  to  a  Markov  chain  with  a  single  recurrent  class. 
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In  average  reward  MDP,  a  policy  //  is  measured  using  a  different  value  function,  namely 
the  average-adjusted  sum  of  rewards  earned  following  that  policy.4 

H^(s)  =  lim  e\Y'  [r(st,  fi(st))  -  g1']  |s0  =  s,g\ 
y  t= o  J 

The  term  II1'  is  usually  referred  to  as  the  average-adjusted  value  function.  Furthermore, 
the  average-adjusted  value  function  satisfies  the  Bellman  equation 

H“ M  +  g"  =  r(s,  »(s))  +  g(s))H"(S') 

s'es 

Similarly,  the  average-adjusted  action-value  function  for  a  policy  /i,  L1’,  is  defined,  and 
it  satisfies  the  Bellman  equation 


LM(s,  a)  +  =  r(s,  a)  +  P(V  |s,  a)L^(s' ,  n(s')) 

s'es 

We  define  a  gain-optimal  policy  /i*  as  one  that  has  the  maximum  average  reward  over  all 
policies,  that  is  g*  >  g».  The  gain-optimal  policy  satisfies  the  following  Bellman  optimality 
equations  for  average-adjusted  value  function  and  average-adjusted  action-value  function 
(Bertsekas,  1995). 


H*{s)  +  g* 


max 


r(s,  a)  +  P{s'\s,  a)H*(s') 

s'es 


(2.6) 


L*(s,a )  +  g*  =  r(s,a )  +  Pfs'ls,  a)  max  L*(s',a ')  (2.7) 

s'&S 

It  is  proved  (Howard,  1960;  Puterman,  1994)  that  for  any  unichain  MDP,  there  exist  a  g*  and 
a  function  H*  over  S  that  satisfy  the  Equation  2.6  (or  a  function  L*  over  SxT  that  satisfies 


4This  limit  assumes  that  all  policies  are  aperiodic.  For  periodic  policies,  it  changes  to  the  Cesaro  limit 

7Z^(s)  =  liimv^  T  J2k=o  E  {Z)t=o  Kst>  x(st))  -  g^]  |s0  =  s, /t|  (Puterman,  1994). 
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the  Equation  2.7).  Further,  g* ,  H*,  and  L*  are  gain,  average-adjusted  value  function,  and 
average-adjusted  action-value  function  of  the  gain-optimal  policy  fi*. 

2.2.4  Solution  Methods  for  MDPs 

Now  that  we  have  defined  the  MDP  model,  the  next  task  is  to  solve  it,  i.e.,  to  find  an 
optimal  policy  and/or  the  optimal  value  function.5  There  are  variety  of  methods  for  achiev¬ 
ing  this.  Some  methods  require  knowing  the  transition  probability  and  reward  functions 
and  are  performed  without  access  to  an  environment;  these  are  considered  offline  algo¬ 
rithms.  These  are  the  standard  dynamic  programming  (DP)  algorithms  from  the  field  of 
operations  research.  Having  the  model  allows  the  simulation  of  the  domain  so  as  to  do 
planning  to  find  the  optimal  value  function  and/or  an  optimal  policy  without  interacting 
directly  with  the  environment.  Other  methods  work  without  assuming  prior  knowledge  of 
the  model  and  operate  by  learning  through  experience  in  the  environment;  these  are  called 
online  algorithms. 

Since  a  value  function  (or  an  action-value  function)  defines  a  policy  in  an  MDP,  one 
approach  to  find  the  optimal  policy  is  to  compute  the  optimal  value  (action-value)  function 
first,  and  then  extract  the  optimal  policy  from  it.  We  call  the  algorithms  utilizing  this  ap¬ 
proach,  value  function  algorithms.  Another  approach  is  to  directly  find  the  optimal  policy. 
The  methods  using  this  approach  are  called  policy  search  methods.  In  Sections  2.2.4. 1  and 
2. 2.4. 2,  we  present  a  brief  overview  of  the  above  two  approaches  to  solve  an  MDP  model. 

2.2.4.1  Value  Function  Solution  Methods  for  MDPs 

Value  function  (VF)  methods  attempt  to  find  the  optimal  value  (action-value)  function 
and  then  extract  an  optimal  policy  from  it.  These  algorithms  have  been  extensively  stud¬ 
ied  in  the  machine  learning  literature  (Bertsekas  and  Tsitsiklis,  1996;  Sutton  and  Barto, 

5What  we  really  mean  by  an  optimal  policy  in  this  section  is  a  reasonably  good  policy.  Since  in  any 
real-world  A1  problem  it  is  not  possible  to  even  imagine  fi  nding  optimal  policies. 
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1998)  and  have  yielded  some  remarkable  empirical  successes  in  a  number  of  different  do¬ 
mains,  including  learning  to  play  checkers  (Samuel,  1959),  backgammon  (Tesauro,  1994), 
job-shop  scheduling  (Zhang  and  Dietterich,  1995),  dynamic  channel  allocation  (Singh  and 
Bertsekas,  1996),  and  elevator  scheduling  (Crites  and  Barto,  1998).  We  now  briefly  review 
some  standard  VF  algorithms. 

If  the  model  is  known,  then  Equation  2.2  defines  a  system  of  equations,  the  solution  to 
which  yields  the  optimal  value  function.  These  equations  may  either  be  solved  directly  via 
solving  a  related  linear  program  (e.g.,  Gordon  (1999);  de  Farias  (2002)),  or  by  iteratively 
performing  the  update 


V{s) 


max 


r(s,  a)  +  7  ^  P(s'\s,  a)V(s') 
s'es 


until  it  converges.  The  latter  of  these  is  called  value  iteration  (Bertsekas  and  Tsitsiklis, 
1996;  Sutton  and  Barto,  1998),  which  is  a  DP-based  algorithm. 

Another  standard  DP-based  algorithm  is  policy  iteration  (Bertsekas  and  Tsitsiklis, 
1996;  Sutton  and  Barto,  1998).  It  uses  a  policy  /j  and  its  estimated  value  function  V, 
and  iteratively  updates  /i  according  to 


ju(s)  =  argmax 


r(s,  a)  +  7  ^  P(s'\s,  a)V(s') 
s'es 


and  updates  V  to  be  the  value  function  V  for  policy  //  by  solving  the  system  of  linear 
equations  given  by  Equation  2.1. 

Other  instances  of  offline  VF  algorithms  are  asynchronous  value  iteration  and  asyn¬ 
chronous  policy  iteration  (Bertsekas  and  Tsitsiklis,  1996;  Sutton  and  Barto,  1998). 

If  the  agent  does  not  know  the  model  of  the  domain,  we  may  first  try  to  interact  with  the 
environment  to  learn  a  model  which  is  then  used  to  compute  optimal  policies  (e.g.,  Dyna 
(Sutton,  1991)  and  prioritized  sweeping  (Moore  and  Atkeson,  1993)).  This  is  known  as 
Model-based  approach.  Alternatively,  we  may  try  to  learn  the  value  (action- value)  function 
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directly  and  do  not  explicitly  learn  a  model.  This  approach  is  referred  to  as  model-free, 
in  that  the  agent  does  not  need  to  leam  the  transition  probabilities.  Most  of  the  model- 
free  VF  algorithms  are  instances  of  the  temporal  difference  (TD)  learning  (Sutton,  1988), 
where  the  agent  updates  estimates  of  the  value  (action-value)  function  based  in  part  on 
other  estimates,  without  waiting  for  the  true  value.  Two  more  popular  TD  methods  are 
SARSA  (Rummery  and  Niranjan,  1994)  and  Q-leaming  (Watkins,  1989). 

The  SARSA  algorithm  performs  the  following  update  upon  seeing  a  transition  from 
state  s  to  s'  when  taking  action  a: 

Q{s ,  a)  =  (1  -  a)Q(s,  a)  +  a  [r(s,  a)  +  7  Q(s\  a)} 


where  a  is  called  the  learning  rate  parameter.  SARSA  causes  action-value  function  Q  to 
converge  to  the  optimal  action-value  function,  if  a  GLIE  (Greedy  in  the  Limit  with  Infinite 
Exploration)  policy  is  used  (Singh  et  al.,  2000a).  SARSA  is  known  as  an  on-policy  method, 
in  that  leams  about  the  policy  that  it  executes. 

The  Q-leaming  algorithm  performs  the  following  update  when  the  agent  takes  action  a 
in  state  s  and  transitions  to  state  s': 


Q(s,  a) 


(1 


a)Q(s,  a)  +  a 


r(s,  a)  +  7  max  Q(s',  a') 

a'eA3i 


It  can  be  shown  that  Q-leaming  converges  with  probability  1.0,  if  the  agent  uses  an  explo¬ 
ration  policy  that  takes  every  state  infinitely  often  and  a  satisfies  some  conditions  (Jaakkola 
et  al.,  1994;  Bertsekas  and  Tsitsiklis,  1996).  Q-learning  is  known  as  an  off-policy  algo¬ 
rithm,  meaning  that  the  agent  does  not  have  to  follow  the  policy  for  which  it  is  learning  a 
value  function.  This  is  advantageous  in  that  a  wider  set  of  exploration  methods  are  allowed. 

Although  most  of  the  VF  algorithms  have  been  focused  on  the  discounted  setting,  av¬ 
erage  reward  VF  methods  have  also  been  well  studied.  An  average  reward  VF  method  is 


24 


an  undiscounted  infinite-horizon  method  for  finding  gain-optimal  policies  of  an  MDP  (Ma- 
hadevan,  1996).  It  is  generally  appropriate  in  modeling  cyclical  control  and  optimization 
tasks,  such  as  queuing,  scheduling,  and  flexible  manufacturing  (Gershwin,  1994;  Puter- 
man,  1994).  Several  different  types  of  average  reward  VF  algorithms  have  been  developed 
including  offline  algorithms  such  as  (Bertsekas,  1998),  model-based  online  methods  such 
as  (Tadepalli  and  Ok,  1998),  discrete-time  model-free  online  algorithms  (Schwartz,  1993; 
Mahadevan,  1996;  Tadepalli  and  Ok,  1996;  Abounadi  et  al.,  2001),  and  continuous-time 
model-free  online  algorithms  (Mahadevan  et  al.,  1997b;  Wang  and  Mahadevan,  1999). 

The  discussion  so  far  assumes  that  the  state  space  S  is  sufficiently  small  that  V  can  be 
stored  explicitly  as  a  table,  with  one  entry  for  each  state.  For  larger  MDPs,  these  methods 
can  be  intractable.  Specifically,  in  many  problems,  the  number  of  states  grows  exponen¬ 
tially  in  the  number  of  state  variables.  Similarly,  if  we  apply  grid-based  discretization  to 
an  n-dimensional  continuous  state  space  to  reduce  the  problem,  we  again  end  up  with  a 
number  of  discretized  states  that  is  exponential  in  n.  Bellman  called  this  problem  the  curse 
of  dimensionality  (Bellman,  1957),  and  it  makes  the  straightforward  application  of  RL 
algorithms  impractical  even  for  many  moderate-dimensional  problems. 

Thus,  in  domains  with  large  or  infinite  state  spaces,  one  looks  for  approximation  tech¬ 
niques  that  are  based  on  a  parametric  representation  of  value  function,  rather  than  exact 
representation.  A  few  examples  of  previous  work  proposing  various  approaches  for  do¬ 
ing  so  in  different  settings  include  (Van-Roy,  1998;  Gordon,  1999;  Roller  and  Parr,  2000; 
Guestrin  et  al.,  2001;  Dietterich  and  Wang,  2002;  de  Farias,  2002),  and  this  topic  remains 
an  area  of  active  research.  The  approximation  methods  have  had  some  prominent  empiri¬ 
cal  successes  as  mentioned  at  the  beginning  of  this  section.  Despite  numerous  successes, 
the  application  of  VF  methods  becomes  problematic  in  domains  with  large  or  infinite  state 
spaces.  This  is  mainly  because  most  algorithms  for  parametrically  approximating  value 
functions  suffer  from  the  following  theoretical  flaw:  the  performance  of  the  policy  derived 
from  the  approximate  value  function  is  not  guaranteed  to  improve  on  each  iteration,  and  in 
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fact  can  be  worse  than  the  policy  in  the  previous  iteration.  This  can  happen  even  when  the 
chosen  parametric  class  contains  a  value  function  whose  derived  policy  is  optimal  (Baxter 
and  Bartlett,  2001).  Additionally,  VF  methods  become  problematic  when  the  state  is  only 
partially  observable,  because  most  methods  for  value  function  estimation  critically  rely  on 
the  Markov  property.  In  the  next  section,  we  will  describe  an  alternative  approach  to  VF 
which  addresses  some  of  the  above  issues,  and  the  problems  that  may  happen  when  they 
are  employed  in  complex  domains. 

2.2.4.2  Policy  Search  Solution  Methods  for  MDPs 

An  alternative  approach  that  circumvents  the  problems  of  VF  methods  mentioned  at  the 
end  of  Section  2.2.4. 1  is  to  directly  search  in  the  space  of  policies.  The  methods  using  this 
approach  to  solve  an  MDP  are  known  as  policy  search  (PS)  methods.6 

PS  methods  have  received  much  recent  attention  as  a  mean  to  solve  problems  with  large 
or  infinite  state  spaces,  and  problems  with  partially  observable  states.  The  motivation  for 
this  is  three  fold.  1)  For  many  MDPs,  the  value  and  action-value  functions  can  be  difficult 
to  approximate,  even  though  there  may  be  simple  and  compactly  representable  policies  that 
perform  very  well.  Indeed,  the  existence  of  a  good,  compact  representation  of  an  action- 
value  function  implies  the  existence  of  a  good,  compact  representation  of  a  policy,  because 
an  action-value  function  defines  a  policy.  In  contrast,  there  is  no  guarantee  that  the  exis¬ 
tence  of  a  good,  compact  representation  of  a  policy  implies  a  good,  compact  representation 
of  an  action- value  function.  2)  Because  PS  algorithms  start  with  a  parameterized  policy,  it  is 
relatively  simple  to  choose  a  policy  which  incorporates  prior  knowledge  via  an  appropriate 
choice  of  the  parametric  form  of  the  policy.  The  use  of  prior  knowledge  in  VF  algorithms 
is  not  as  easily  realized.  Finally,  3)  many  real  domains  are  only  partially  observable,  and 
VF  algorithms  are  known  to  be  difficult  to  implement  in  such  domains.  Conversely,  PS 

6Policy  iteration  can  also  be  considered  a  policy  search  (PS)  method.  However,  since  it  uses  value  func¬ 
tion,  we  categorize  it  as  a  value  function  (VF)  method. 
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algorithms  have  been  shown  to  work  more  effectively  in  partially  observable  domains.  We 
might  use  a  class  of  policies  that  depend  only  on  the  observables.  This  results  in  a  class  of 
memoryless  (reactive)  policies  that  can  be  applied  to  POMDP  models  (Williams  and  Singh, 
1999).  We  can  also  introduce  memory  variables  into  the  process  state,  and  define  limited 
memory  policies  (Mealeau  et  al.,  1999).  It  permits  belief  state  tracking,  in  which  the  agent 
uses  past  and  present  observations  to  estimate  the  true  state. 

Of  course,  while  PS  methods  provide  a  powerful  tool  for  solving  many  problems  in  RL 
and  control,  there  are  also  settings  in  which  VF  algorithms  may  be  preferred.  For  instance, 
explicitly  searching  in  a  policy  space  for  a  good  policy  may  be  computationally  expensive 
and  more  prone  to  local  optima  than  certain  VF  methods.  So,  if  there  is  reason  to  believe 
that  the  value  function  can  be  easily  approximated,  then  the  VF  approach  would  perhaps 
be  method  of  choice.  Moreover,  if  we  do  not  have  a  prior  knowledge  about  a  likely  form 
of  a  good  policy,  then  one  may  instead  use  a  VF  algorithm. 

A  well-known  class  of  PS  algorithms  are  policy  gradient  (PG)  algorithms.  In  these 
methods,  we  usually  consider  a  class  of  parameterized  stochastic  policies,  estimate  the  gra¬ 
dient  of  a  performance  function  (e.g.,  average  reward  over  time  or  weighted  reward-to-go) 
with  respect  to  policy  parameters,  and  then  improve  the  policy  by  adjusting  the  parame¬ 
ters  in  the  direction  of  the  gradient  (Williams,  1992;  Kimura  et  al.,  1995;  Marbach,  1998; 
Baxter  et  al.,  2001).  This  approach  has  a  long  history  in  operations  research,  statistics, 
and  control,  forming  the  basis  of  perturbation  analysis  of  discrete  event  dynamic  systems 
(Ho  and  Cao,  1991;  Cassandras  and  Lafortune,  1999).  In  addition  to  the  pros  and  cons  of 
PS  methods  mentioned  above,  one  advantage  of  PG  algorithms  compared  to  VF  methods 
is  that  they  are  theoretically  guaranteed  to  converge  to  locally  optimal  policies,  whereas 
VF  algorithms  can  find  globally  optimal  solutions.  However,  in  practice  it  is  usually  not 
feasible  to  converge  to  globally  optimal  solutions  in  large  domains  in  any  case.  However, 
PG  methods  usually  suffer  from  the  following  problems.  1)  They  may  require  up  to  an 
amount  of  sampling/number  of  steps  that  is  exponential  in  the  number  of  states  or  in  the 
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horizon  time.  2)  They  are  also  limited  to  stochastic  policies.  In  some  domains,  it  seems 
very  undesirable  to  add  extra  randomness  to  an  already  stochastic  problem  by  forcing  our 
policy  to  randomly  choose  its  actions.  3)  They  generally  sample  from  the  MDP  once  to 
take  a  small  uphill  step  and  then  throw  away  the  data. 

One  way  to  address  some  of  the  issues  of  using  PG  methods  is  to  assume  that  the 
learning  algorithm  has  access  to  the  MDP  via  a  generative  model  or  a  simulator  (Keams 
et  al.,  2000;  Ng  and  Jordan,  2000;  Ng,  2003).  Ng  et  al.  (2004)  recently  showed  a  very 
impressive  application  of  this  type  of  PS  methods  to  autonomous  helicopter  flight. 

2.3  Semi-Markov  Decision  Processes 

Semi-Markov  decision  processes  (SMDPs)  (Howard,  1971;  Puterman,  1994)  extend  the 
MDP  model  by  allowing  actions  that  take  multiple  time  steps  to  complete.  The  action  du¬ 
ration  can  depend  on  the  transition  that  is  made.7  The  state  of  the  system  may  change  con¬ 
tinually  between  actions,  unlike  MDPs  where  state  changes  are  only  due  to  actions.  Thus, 
SMDPs  have  become  the  preferred  language  for  modeling  temporally  extended  actions 
(Mahadevan  et  al.,  1997a),  which  makes  them  very  appealing  in  the  context  of  hierarchical 
reinforcement  learning,  as  we  will  see  in  Section  2.4.3. 

An  SMDP  is  defined  as  a  five  tuple  (S,  A,  V ,  7 Z,  I).  All  components  are  defined  as  in 
an  MDP  except  the  transition  probability  function  and  the  reward  function.  The  transition 
probability  function  V  now  takes  the  duration  of  the  actions  into  account.  The  transition 
probability  function  V  :  x  IX  x  5  x  4  -»  [0,1]  is  a  multi-step  transition  probability 
function,  with  P(s',  iV|s,  a)  denotes  the  probability  that  action  a  will  cause  the  system  to 
transition  from  state  s  to  state  s'  in  N  time  steps.  This  transition  is  at  decision  epochs  only. 
Basically,  the  SMDP  model  represents  snapshots  of  the  system  at  decision  points,  whereas 
the  so-called  natural  process  describes  the  evolution  of  the  system  over  all  times.  If  we 

7We  are  thus  dealing  with  discrete-time  SMDPs.  Continuous-time  SMDPs  typically  allow  arbitrary  con¬ 
tinuous  action  durations. 
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marginalize  P(s',  iV|s,  a)  over  N,  we  will  obtain  rn(s'\s.  a)  the  transition  probability  for 
the  embedded  MDP.  The  term  m(s'\s,  a )  denotes  the  probability  that  the  SMDP  occupies 
state  s'  at  the  next  decision  epoch,  given  that  the  decision  maker  chooses  action  a  in  state  s 
at  the  current  decision  epoch.  The  key  difference  in  the  reward  function  for  SMDPs  is  that 
the  rewards  can  accumulate  over  the  entire  duration  of  an  action.  As  a  result,  SMDP  reward 
for  taking  an  action  in  a  state  depends  on  the  evolution  of  the  system  during  the  execution 
of  the  action.  Formally,  SMDP  reward  is  modeled  as  a  function  from  1Z  :  S  x  A  — >  1R, 
with  r(s,  a)  represents  the  expected  total  reward  between  two  decision  epochs,  given  that 
the  system  occupies  state  s  at  the  first  decision  epoch  and  the  agent  chooses  action  a.  This 
expected  reward  contains  all  necessary  information  about  the  reward  to  analyze  the  SMDP 
model. 

For  each  transition  in  an  SMDP,  the  expected  number  of  time  steps  until  the  next  deci¬ 
sion  epoch  is  defined  as 

V(s,  a)  =  E[N\s,  a]  =  ]T  TV  ]T  P(s',  N\s ,  a) 

iVeN  s'es 

The  notions  of  policies  and  the  various  forms  of  optimality  are  the  same  for  SMDPs  as 
for  MDPs.  In  infinite-horizon  SMDPs,  our  goal  is  still  to  find  a  policy  that  maximizes  either 
the  expected  discounted  reward  or  the  average  expected  reward.  These  two  optimality 
criteria  for  an  SMDP  model  will  be  discussed  in  sections  2.3.1  and  2.3.2. 

2.3.1  Discounted  Reward  Semi-Markov  Decision  Processes 

Recall  that  for  a  discounted  MDP  model,  we  expressed  the  expected  value  for  follow¬ 
ing  a  policy  as  E  In  discounted  SMDP,  because  actions  can  take 

variable  amounts  of  time,  the  value  of  a  state  s  under  a  policy  //  is  defined  as  follows: 

V^(s)  =  E  [r(s0,  //(s0))  +  7N°r(s1,  +  7Ar°+Arir(s2,  /z(s2))  +  . . .  |s0  =  s,  /x] 
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Now  we  can  express  the  Bellman  equations  for  discounted  SMDPs  as 


r<‘(«)=r(s,M»))+  E  t'V-P(s',JV|s,A»(s))'/'‘(s') 

s'eS,N£l N 


Q^(s,a)  =  r(s,a)  +  ^  lNp(s',N\ s,a)Qtt(s',n(s')) 

s'eS,N£U 

Similarly,  we  can  write  the  Bellman  optimality  equations  defining  the  optimal  value  func¬ 
tion  and  optimal  action-value  function  as 


V{s) 


max 


r(s,  a) 


Y  7WJ’(»',JV|s,o)V*(s') 

s'es,NeT<i 


Q*(s,  a)  =  r(s,  a) 


7JVP(s/,  N\s,  a)  max  Q*(s',  a') 
s'eS,NeKi  a  s ' 


2.3.2  Average  Reward  Semi-Markov  Decision  Processes 

The  theory  of  infinite-horizon  SMDPs  with  the  average  reward  criterion  is  more  com¬ 
plex  than  that  for  discounted  models  (Howard,  1971;  Puterman,  1994).  To  simplify  expo¬ 
sition  we  consider  only  unichain  SMDPs.  Under  this  assumption,  the  gain  of  any  policy  is 
state  independent  similar  to  the  average  reward  MDP  model. 

The  average  expected  reward  or  gain  for  a  policy  //,  g'\  can  be  defined  by  taking  the 
ratio  of  the  expected  total  reward  and  the  expected  total  number  of  time  steps. 


g M  =  lim  inf 

n— >  OO 


(2.8) 
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where  Nt  is  the  total  number  of  time  steps  until  the  next  decision  epoch,  when  agent  takes 
action  y(st)  in  state  st.  When  the  state  space  of  the  SMDP,  S,  is  finite  or  countable, 
Equation  2.8  can  be  written  as8 


fi  =  mflr(s,  y(s)) 
rh^y(s,fi(s)) 


(2.9) 


where  m'1  and  rriJ1  =  lim,woo  2  are  the  transition  probability  matrix  and  the 

limiting  matrix  of  the  embedded  Markov  chain  for  policy  /;  respectively.9 

The  Bellman  equations  for  the  average-adjusted  value  function  II1'  and  the  average- 
adjusted  action-value  function  L1'  can  be  written  as 


H^(s)  =  r(s,  n(s))  -  g^y(s,  /i(s))  +  ^  P{s’,  N\s, 

s'eS,Neu 


a)  =  r(s,  a)  -  g^y(s,  a)  +  ^  P(s',  N\s,  a)LM(s',  jn(s')) 

s'eS,N£N 

2.3.3  Solution  Methods  for  SMDPs 

Almost  all  the  standard  solution  methods  for  MDPs  generalize  easily  to  SMDPs.  Re¬ 
vised  policy  and  value  iteration  algorithms  are  straightforward,  using  the  SMDP  Bellman 
equations  but  with  all  other  elements  remaining  the  same.  It  can  be  shown  that  these  algo¬ 
rithms  converge  (Howard,  1971;  Puterman,  1994). 

Online  algorithms  such  as  SARSA  and  Q-leaming  also  generalize  to  the  SMDP  case 
(Bradtke  and  Duff,  1995)  .  Parr  (1998)  showed  that  the  following  version  of  Q-leaming 

8Under  the  unichain  assumption,  fh  has  equal  rows.  Therefore,  the  right  hand  side  of  Equation  2.9  is  a 
vector  with  elements  all  equal  to  g 

''The  limiting  matrix  fh  satisfi  es  the  equality  mm,  =  fh. 
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converges  in  the  SMDP  case  with  several  small  differences  in  the  conditions  and  assump¬ 
tions  of  the  proof. 


Q(s,  a)  =  (1  —  a)Q(s,  a)  +  a 


r(s,  a)  +  7  v  max  Q(s' ,  a' 
a’&A,/ 


This  is  the  update  formula  when  the  agent  takes  action  a  in  state  s,  transitions  to  state  s', 
the  transition  takes  N  time  steps,  and  the  agent  receives  reward  r(s,  a)  on  its  way  to  state 


s'. 


2.4  Hierarchy  and  Temporal  Abstraction 

Reasoning  and  learning  about  temporally  extended  actions  has  been  studied  extensively 
in  several  fields  including  classical  AI,  control  theory,  and  RL.  In  this  section,  we  look  at 
the  historical  development  of  hierarchy  and  temporal  abstraction  in  classical  AI,  control, 
and  RL. 

2.4.1  Temporal  Abstraction  in  Classical  AI 

The  problem  of  using  abstraction  to  facilitate  planning  has  been  a  key  focus  of  AI 
research  since  its  early  days.  The  key  idea  was  to  replace  the  low-level  actions  available  to 
solve  a  given  task  by  macro  operators,  open-loop  sequences  of  actions  that  can  achieve 
some  subgoal.  It  can  provide  exponential  reduction  in  the  computational  cost  of  finding 
good  plans. 

Different  forms  of  representation  have  been  used  for  macro-operators,  such  as  proce¬ 
dural  nets  (Sacerdoti,  1974),  and  hierarchical  task  networks  (Currie  and  Tate,  1991).  All 
these  representations  have  these  issues  in  common,  the  way  in  which  the  macro-operator 
selects  actions,  and  the  model  it  uses  to  predict  its  consequences.  However,  the  key  issue 
is  learning  useful  macro-operators,  which  can  be  reused  to  solve  different  planning  prob¬ 
lems.  Korf  (1985)  introduced  a  method  which  decomposes  a  planning  problem  to  a  set  of 
independent  and  serializable  subgoals,  solves  subgoals  individually,  and  then  combines  the 
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corresponding  macro-operators  to  solve  the  larger  planning  problems.  The  SOAR  system 
(Laird  et  al.,  1986)  used  a  chunking  mechanism,  by  which  action  sequences  used  to  solve 
subtasks  were  memorized  as  macro-operators.  Knoblock  (1990)  addressed  the  learning  of 
macro-operators  with  the  pre-conditions  under  which  they  succeed  or  fail.  His  work  identi¬ 
fies  conditions  under  which  a  solution  obtained  in  an  abstracted  state  and  action  space  can 
be  indeed  executed.  Drescher  (1991)  advocated  a  constructive  approach  in  which  knowl¬ 
edge  about  the  world  is  gradually  acquired  in  the  form  of  schemas,  elementary  models 
containing  a  context  (state),  an  action,  and  a  result  (new  state).  Schemas  are  built  with  the 
purpose  of  capturing  regularities  in  the  environment,  and  subsequently  are  used  to  construct 
new  composite  actions  by  sequencing  existing  primitives. 

More  recent  research  even  takes  into  account  the  assumption  of  stochastic  environment 
in  which  the  plans  have  to  be  executed  (Oates  and  Cohen,  1996;  Brafman  and  Tennenholtz, 
1997).  Probabilistic  and  statistical  methods  such  as  belief  and  value  function,  as  well  as 
closed-loop  behaviors  are  used  to  deal  with  such  environments. 

2.4.2  Temporal  Abstraction  in  Control 

Modeling  and  control  of  multiple  time  scale  systems  is  an  active  research  area  in  control 
theory  where  temporally  extended  actions  and  models  have  been  extensively  used.  Multiple 
scale  systems  are  often  characterized  by  a  fast  motion  superimposed  over  a  slow  motion.  If 
the  two  motions  do  not  influence  each  other,  then  the  fast  motion  can  be  modeled  and  then 
eliminated  to  analyze  the  slow  motion. 

Perhaps  the  first  application  of  temporal  abstraction  in  stochastic  control  is  the  work  by 
Forestier  and  Varaiya  (1978).  They  proposed  using  a  two  layer  system  where  a  supervisor  at 
the  higher  layer  monitors  the  plant  and  intervenes  only  when  the  plant  reaches  a  predefined 
boundary  condition,  and  lower-level  controls  the  plant  between  the  boundary  conditions. 
The  problem  of  choosing  the  optimal  lower-level  controller  at  each  boundary  state  is  a 
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decision  problem  operating  at  a  slower  time  scale  with  only  the  boundary  states  as  states 
and  only  the  lower-level  controllers  as  actions. 

The  problem  of  controlling  a  system  at  multiple  time  scales  has  also  been  addressed 
by  singular  perturbation  methods  (Kokotovic  et  al.,  1986;  Ho  and  Cao,  1991;  Cao  et  al., 
2002).  These  methods  assume  that  the  system  to  be  controlled  has  state  variables  with  fast 
and  slow  variations.  Each  type  of  variation  is  modeled  separately  which  leads  to  a  form  of 
hierarchical  control.  The  slow  variation  states  are  ignored  initially,  and  are  controlled  only 
after  the  fast  variation  states  have  been  accounted  for. 

2.4.3  Temporal  Abstraction  in  Reinforcement  Learning 

Temporally  extended  actions  have  been  studied  in  hierarchical  probabilistic  planning 
and  hierarchical  reinforcement  learning  (HRL).  HRL  is  a  general  framework  for  scaling 
RL  to  problems  with  large  state  spaces  by  using  the  task  (or  action)  structure  to  restrict  the 
space  of  policies.  The  key  principle  underlying  HRL  is  to  develop  learning  algorithms  that 
do  not  need  to  learn  policies  from  scratch,  but  instead  reuse  existing  policies  for  simpler 
subtasks  (or  macro-actions).  Macros  form  the  basis  of  hierarchical  specifications  of  action 
sequences  because  macros  can  include  other  macros  in  their  definitions.  It  is  similar  to 
the  familiar  idea  of  subroutine  from  programming  languages.  A  subroutine  can  call  other 
subroutines  as  well  as  execute  primitive  commands.  Most  of  the  existing  HRL  models  have 
roughly  the  same  semantics  as  hierarchies  of  macros.  However,  a  macro  as  an  open-loop 
control  policy  is  inappropriate  for  most  interesting  control  purposes,  especially  the  control 
of  stochastic  systems.  HRL  methods  generalize  the  macro  idea  to  closed-loop  policies  or 
more  precisely,  closed-loop  partial  policies  because  they  are  generally  defined  for  a  subset 
of  the  state  space.  The  partial  policies  must  also  have  well-defined  termination  conditions. 
These  partial  policies  with  well-defined  termination  conditions  are  sometimes  called  tem¬ 
porally  extended  actions.  Work  in  HRL  has  followed  three  main  trends:  focusing  on 
subsets  of  the  state  space  in  a  divide-and  conquer  approach  (state  space  decomposition), 
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grouping  sequences  or  sets  of  actions  together  (temporal  abstraction),  and  ignoring  differ¬ 
ences  between  states  based  on  the  context  (state  abstraction).  Much  of  the  work  falls  into 
several  of  these  categories. 

Singh  (1992)  introduced  hierarchies  of  abstract  actions,  which  achieve  different  tasks, 
as  well  as  a  hierarchy  of  models  with  variable  temporal  resolution.  Singh  used  a  special 
purpose  gating  architecture  to  switch  between  abstract  actions,  and  specialized  learning 
algorithms  for  this  architecture.  Kaelbling  (1993a,b)  proposed  the  idea  of  using  subgoals 
both  in  order  to  learn  sub-policies  and  to  collapse  the  state  space.  Dayan  and  Hinton  (1993) 
presented  Feudal  RL,  a  hierarchical  technique  which  uses  both  temporal  abstraction  and 
state  abstraction.  It  recursively  partitions  the  state  space  and  the  time  scale  from  one  level 
to  the  next. 

The  difficulty  with  using  the  above  methods  was  that  decisions  in  HRL  are  no  longer 
made  at  synchronous  time  steps,  as  is  traditionally  assumed  in  RL.  Instead,  agent  makes  de¬ 
cision  in  epochs  of  variable  length,  such  as  when  a  distinguishing  state  is  reached  (e.g.,  an 
intersection  in  a  robot  navigation  task),  or  a  subtask  is  completed  (e.g.,  the  elevator  arrives 
on  the  first  floor).  Fortunately,  a  well-known  statistical  model  is  available  to  treat  variable 
length  actions:  the  SMDP  model  described  in  Section  2.3.  Here,  state  transition  dynam¬ 
ics  is  specified  not  only  by  the  state  where  an  action  was  taken,  but  also  by  parameters 
specifying  the  length  of  time  since  the  action  was  taken.  Early  work  in  RL  on  the  SMDP 
model  studied  extensions  of  algorithms  such  as  Q-leaming  to  continuous-time  (Bradtke  and 
Duff,  1995;  Mahadevan  et  al.,  1997b).  The  early  work  on  SMDP  model  was  then  expanded 
to  include  hierarchical  task  models  over  fully  or  partially  specified  lower  level  subtasks, 
which  led  to  developing  powerful  HRL  models  such  as  hierarchies  of  abstract  machines 
(HAMs)  (Parr,  1998),  options  (Sutton  et  al.,  1999;  Precup,  2000),  MAXQ  (Dietterich, 
2000),  and  programmable  HAMs  (PHAMs)  (Andre  and  Russell,  2001;  Andre,  2003).  In 
the  options  model  (at  least  in  its  simplest  form),  Sutton  et.  al.  studied  how  to  leam  poli¬ 
cies  given  fully  specified  policies  for  executing  subtasks.  In  the  HAMs  formulation,  Parr 
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showed  how  hierarchical  learning  could  be  achieved  even  when  the  policies  for  lower-level 
subtasks  were  only  partially  specified.  The  MAXQ  model  is  one  of  the  first  methods  to 
combine  temporal  abstraction  with  state  abstraction.  It  provides  a  more  comprehensive 
framework  for  hierarchical  learning  where  instead  of  policies  for  subtasks,  the  learner  is 
given  pseudo-reward  functions.  Unlike  options  and  HAMs,  MAXQ  does  not  rely  directly 
on  reducing  the  entire  problem  to  a  single  SMDP.  Instead,  a  hierarchy  of  SMDPs  is  created 
whose  solutions  can  be  learned  simultaneously.  The  key  feature  of  MAXQ  is  the  decom¬ 
posed  representation  of  the  value  function.  Dietterich  views  each  subtask  as  a  separate 
MDP,  and  thus  represents  the  value  of  a  state  within  that  MDP  as  composed  of  the  reward 
for  taking  an  action  at  that  state  (which  might  be  composed  of  many  rewards  along  a  tra¬ 
jectory  through  a  subtask)  and  the  expected  reward  for  completing  the  subtask.  To  isolate 
the  subtask  from  the  calling  context,  Dietterich  uses  the  notion  of  a  pseudo-reward.  At  the 
terminal  states  of  a  subtask,  the  agent  is  rewarded  according  to  the  pseudo-reward,  which 
is  set  a  priori  by  the  designer,  and  does  not  depend  on  what  happens  after  leaving  the  cur¬ 
rent  subtask.  Each  subtask  can  then  be  treated  in  isolation  from  the  rest  of  the  problem 
with  the  caveat  that  the  solutions  learned  are  only  recursively  optimal.  Each  action  in  the 
recursively  optimal  policy  is  optimal  with  respect  to  the  subtask  containing  the  action,  all 
descendant  subtasks,  and  the  pseudo-reward  chosen  by  the  designer  of  the  system.  Another 
important  contribution  of  Dietterich’s  work  is  the  idea  that  state  abstraction  can  be  done 
separately  on  the  different  components  of  the  value  function,  which  allows  one  to  perform 
more  abstraction.  We  investigate  the  MAXQ  framework  and  its  related  concepts  such  as 
pseudo-reward,  recursive  optimality,  value  function  decomposition,  and  state  abstraction  in 
more  details  in  Chapter  3.  In  the  PHAMs  model,  Andre  and  Russell  extended  HAMs  and 
presented  an  agent  design  language  for  RL.  Andre  and  Russell  (2002)  also  addressed  the 
issue  of  safe  state  abstraction  in  HRL.  Their  method  yields  state  abstraction  while  main¬ 
taining  hierarchical  optimality. 
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HRL  has  also  been  successfully  applied  to  behavior-based  robotics  (Brooks,  1986)  in 
several  applications  (Mahadevan  and  Connell,  1992;  Lin,  1993;  Digney,  1996;  Mataric, 
1997;  Huber  and  Grupen,  1997).  Mahadevan  and  Connell  used  a  subsumption  architecture 
in  which  simple  behaviors  are  acquired  using  RL  and  then  are  combined  by  a  pre-defined 
scheme  to  solve  a  complex  robot  box-pushing  task.  Lin  used  the  decomposition  of  a  com¬ 
plex  task  into  smaller  subtasks,  each  having  its  own  limited  state  space  and  its  own  reward 
function.  A  robot  can  leam  a  behavior  for  solving  each  subtask,  and  then  use  RL  at  the 
higher  level  in  order  to  determine  the  best  combination  of  sub-behaviors.  Huber  used  RL 
and  a  hybrid  discrete  event  dynamical  system  to  leam  walking  gaits  for  a  robot.  At  the  low 
level,  the  robot  uses  a  set  of  pre-existing  controllers  that  can  generate  collision-free  motion 
and  optimize  forces  and  posture.  At  the  higher  level,  RL  is  used  to  determine  which  con¬ 
troller  should  be  applied,  depending  on  a  set  of  discrete  variables  describing  the  state  of  the 
system. 

Recent  research  is  also  targeted  toward  finding  temporally  extended  actions  automati¬ 
cally.  Thrun  and  Schwartz  (1995)  and  Pickett  and  Barto  (2002)  generate  temporal  abstrac¬ 
tions  by  finding  commonly  occurring  sub-policies  in  solutions  to  a  set  of  tasks.  Digney 
(1996),  McGovern  and  Barto  (2001),  Menache  et  al.  (2002),  and  Simsek  and  Barto  (2004) 
identify  subgoal  states  and  generate  temporally  extended  actions  that  take  the  agent  to  these 
states.  Digney’s  subgoals  are  states  that  are  visited  frequently  or  that  have  a  high  reward 
gradient.  McGovern  and  Barto’s  method  identifies  as  subgoals  those  regions  of  the  state 
space  that  the  agent  visits  frequently  on  successful  trajectories  but  not  on  unsuccessful 
ones.  Menache  et  al.  define  subgoals  as  the  border  states  of  strongly  connected  areas 
of  the  MDP  transition  graph  and  find  them  using  a  max-flow/min-cut  algorithm.  Simsek 
and  Barto  propose  a  method  to  identify  useful  temporal  abstractions  using  relative  novelty. 
Their  definition  of  novelty  relates  it  to  how  frequently  a  state  is  visited  since  a  designated 
start  time.  They  define  relative  novelty  of  a  state  in  a  transition  sequence  as  the  ratio  of 
the  novelty  of  states  that  followed  it  (including  itself)  to  the  novelty  of  the  states  that  pre- 
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ceded  it.  Hengst  (2002)  and  Jonsson  and  Barto  (2005)  proposed  constructing  a  hierarchy 
of  abstractions  in  problems  with  factored  state  spaces.  Hengst’s  method  orders  state  vari¬ 
ables  with  respect  to  their  frequency  of  change  and  adds  a  layer  of  hierarchy  for  each  state 
variable,  where  each  layer  handles  a  smaller  MDP  than  its  lower  layers.  Jonsson  and  Barto 
determine  causal  relationships  between  state  variables  using  a  dynamic  Bayesian  network 
(DBN)  model  of  factored  MDPs  and  like  Hengst’s  algorithm,  their  algorithm  introduces 
layers  of  temporally  extended  actions  based  on  the  causal  structure  of  the  task.  Mannor 
et  al.  (2004)  find  clusters  of  states  and  define  temporally  extended  actions  as  a  sub-policy 
that  allows  the  agent  to  efficiently  shift  from  one  cluster  to  the  other.  They  use  two  differ¬ 
ent  clustering  mechanisms,  one  that  employs  only  topology,  and  one  that  uses  the  reward 
structure  of  the  problem  in  addition  to  topology. 

2.5  Multi-Agent  Reinforcement  Learning 

The  analysis  of  multi-agent  systems  is  a  topic  of  interest  in  both  economic  theory  and 
AL  Their  integration  with  existing  methods  in  AI  constitutes  a  promising  area  of  research. 
An  optimal  policy  in  a  multi-agent  system  may  depend  on  the  behavior  of  other  agents, 
which  is  often  not  predictable.  It  makes  learning  and  adaptation  a  necessary  component  of 
an  agent.  Multi-agent  learning  studies  algorithms  for  selecting  actions  for  multiple  agents 
coexisting  in  the  same  environment.  This  is  a  complicated  problem,  because  the  behav¬ 
iors  of  the  other  agents  can  be  changing  as  they  also  adapt  to  achieve  their  own  goals.  It 
usually  makes  the  environment  non- stationary  and  often  non-Markovian  as  well  (Mataric, 
1997).  Robosoccer;  disaster  rescue,  where  robots  must  safely  find  victims  as  fast  as  possi¬ 
ble  after  an  earthquake;  e-commerce;  manufacturing  systems,  where  managers  of  a  factory 
coordinate  to  maximize  the  profit;  and  distributed  sensor  networks,  where  multiple  sen¬ 
sors  collaborate  to  perform  a  large-scale  sensing  task  under  strict  power  constraints,  are 
examples  of  challenging  multi-agent  domains  that  need  robust  learning  algorithms  for  co- 
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ordination  among  multiple  agents  or  effectively  responding  to  other  agents  (Weiss,  1999; 
Lesser  et  al.,  2003). 

In  addition  to  the  existing  methods  in  distributed  AI  and  machine  learning,  game  theory 
also  provides  a  framework  for  research  in  multi-agent  learning.  The  game  theoretic  con¬ 
cepts  of  stochastic  game  and  Nash  equilibria  (Owen,  1995;  Filar  and  Vrieze,  1997)  are 
the  foundation  for  much  of  the  recent  research  in  multi-agent  learning.  Learning  algorithms 
use  stochastic  games  as  a  natural  extension  of  MDPs  to  multiple  agents.  These  algorithms 
can  be  summarized  by  broadly  grouping  them  into  two  categories:  equilibria  learners  and 
best-response  learners.  Equilibria  learners  such  as  Minimax-Q  (Littman,  1994),  Nash-Q 
(Hu  and  Wellman,  1998),  the  gradient  ascent  learner  in  (Singh  et  al.,  2000b),  and  Friend-or- 
Foe-Q  (Fittman,  2001)  seek  to  learn  an  equilibrium  of  the  game  by  iteratively  computing 
intermediate  equilibria.  They  guarantee  convergence  to  their  part  of  an  equilibrium  so¬ 
lution  regardless  of  the  behavior  of  the  other  agents.  On  the  other  hand,  best-response 
learners  seek  to  learn  the  best  response  to  the  other  agents.  Although  not  an  explicitly 
multi-agent  algorithm,  Q-learning  (Watkins,  1989)  was  one  of  the  first  algorithms  applied 
to  multi-agent  problems  (Tan,  1993;  Crites  and  Barto,  1998).  Joint-state/joint-action  learn¬ 
ers  (Boutilier,  1999)  and  WoFF-PHC  (Bowling  and  Veloso,  2002)  are  another  examples  of 
a  best-response  learner.  It  has  been  shown  by  Bowling  and  Veloso  (2002)  that  if  an  algo¬ 
rithm  in  which  best-response  learners  playing  with  each  other  converges,  it  must  be  to  a 
Nash  equilibrium. 

Multi-agent  learning  has  been  recognized  to  be  challenging  for  two  main  reasons:  1) 
curse  of  dimensionality:  the  number  of  parameters  to  be  learned  increases  dramatically 
with  the  number  of  agents,  and  2)  partial  observability:  states  and  actions  of  the  other 
agents  which  are  required  for  an  agent  to  make  decision  are  not  fully  observable  and  inter¬ 
agent  communication  is  usually  costly. 

Prior  work  in  multi-agent  learning  have  addressed  the  curse  of  dimensionality  in  many 
different  ways.  One  natural  approach  is  to  restrict  the  amount  of  information  that  is  avail- 
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able  to  each  agent  and  hope  to  maximize  the  global  payoff  by  solving  local  optimization 
problems  for  each  agent.  This  idea  has  been  addressed  using  value  function  RL  (Schneider 
et  al.,  1999)  as  well  as  policy  gradient  RL  (Peshkin  et  al.,  2000).  Another  approach  is  to 
exploit  the  structure  in  a  multi-agent  problem  using  factored  value  functions.  Guestrin  et  al. 
(2002)  integrate  these  ideas  in  collaborative  multi-agent  domains.  They  use  value  function 
approximation  and  approximate  the  joint  value  function  as  a  linear  combination  of  local 
value  functions,  each  of  which  relates  only  to  the  parts  of  the  system  controlled  by  a  small 
number  of  agents.  Factored  value  functions  allow  the  agents  to  find  a  globally  optimal 
joint-action  using  a  message  passing  scheme.  However,  this  approach  does  not  address  the 
communication  cost  in  its  message  passing  strategy. 

Graphical  models  have  also  been  used  to  address  the  curse  of  dimensionality  in  multi¬ 
agent  systems.  This  work  seeks  to  transfer  the  representational  and  computational  benefits 
that  graphical  models  provide  to  probabilistic  inference  in  multi-agent  systems  and  game 
theory  (La-Mura,  2000;  Roller  and  Milch,  2001).  The  previous  work  established  algorithms 
for  computing  Nash  equilibria  in  one-stage  games,  including  efficient  algorithms  for  com¬ 
puting  approximate  (Kearns  et  al.,  2001)  and  exact  (Littman  et  al.,  2002)  Nash  equilibria  in 
tree- structured  games,  and  convergent  heuristics  for  computing  Nash  equilibria  in  general 
graphs  (Vickrey  and  Roller,  2002;  Ortiz  and  Kearns,  2003). 

The  curse  of  dimensionality  has  also  been  addressed  in  multi-agent  robotics.  Multi¬ 
robot  learning  methods  usually  reduce  the  complexity  of  the  problem  by  not  modeling 
joint  states  or  actions  explicitly,  such  as  work  by  Mataric  (1997)  and  Balch  and  Arkin 
(1998),  among  others.  In  such  systems,  each  robot  maintains  its  position  in  a  formation 
depending  on  the  locations  of  the  other  robots,  so  there  is  some  implicit  communication  or 
sensing  of  states  and  actions  of  the  other  agents.  There  has  also  been  work  on  reducing  the 
parameters  needed  for  Q-learning  in  multi-agent  domains  by  learning  action-values  over  a 
set  of  derived  features  (Stone  and  Veloso,  1999).  These  derived  features  are  domain  specific 
and  have  to  be  encoded  by  hand,  or  constructed  by  a  supervised  learning  algorithm. 
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Almost  all  the  above  methods  ignore  the  problem  that  an  agent  might  not  have  free 
access  to  the  other  agents’  information  that  are  required  to  make  its  own  decision.  In  gen¬ 
eral,  the  world  is  partially  observable  for  each  agent  in  a  distributed  multi-agent  setting. 
POMDPs  have  been  used  to  model  partial  observability  in  probabilistic  AI.  The  POMDP 
framework  can  be  extended  to  allow  for  multiple  distributed  agents  to  base  their  decisions 
on  their  local  observations.  This  model  is  called  decentralized  POMDP  (DEC-POMDP) 
and  it  has  been  shown  that  the  decision  problem  for  a  DEC-POMDP  is  NEXP-complete 
(Bernstein  et  al.,  2000).  One  way  to  address  partial  observability  in  distributed  multi-agent 
domains  is  to  use  communication  to  exchange  required  information.  However,  since  com¬ 
munication  can  be  costly,  in  addition  to  its  normal  actions,  each  agent  needs  to  decide 
about  communication  with  other  agents  (Xuan  et  al.,  2001;  Xuan  and  Lesser,  2002).  Py- 
nadath  and  Tambe  (2002)  extended  DEC-POMDP  by  including  communication  decisions 
in  the  model,  and  proposed  a  framework  called  communicative  multi-agent  team  deci¬ 
sion  problem  (COM-MTDP).  Since  DEC-POMDP  can  be  reduced  to  COM-MTDP  with 
no  communication  by  copying  all  the  other  model  features,  decision  problem  for  a  COM- 
MTDP  is  also  NEXP-complete  (Pynadath  and  Tambe,  2002).  The  trade-off  between  the 
quality  of  solution,  the  cost  of  communication,  and  the  complexity  of  the  model  is  cur¬ 
rently  a  very  active  area  of  research  in  multi-agent  learning  and  planning. 
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CHAPTER  3 


A  FRAMEWORK  FOR  HIERARCHICAL  REINFORCEMENT 

LEARNING 


In  this  chapter,  we  introduce  a  general  hierarchical  reinforcement  learning  (HRL)  frame¬ 
work  for  simultaneous  learning  of  policies  at  multiple  levels  of  hierarchy.  Our  treatment 
builds  upon  the  existing  approaches  such  as  HAMs  (Parr,  1998),  options  (Sutton  et  al., 
1999;  Precup,  2000),  MAXQ  (Dietterich,  2000),  and  PHAMs  (Andre  and  Russell,  2002; 
Andre,  2003),  especially  the  MAXQ  value  function  decomposition.  In  our  framework, 
we  add  three-part  value  function  decomposition  (Andre  and  Russell,  2002)  to  guarantee 
hierarchical  optimality,  and  reward  shaping  (Ng  et  al.,  1999)  to  reduce  the  burden  of  ex¬ 
ploration,  to  the  MAXQ  method.  Rather  than  redundantly  explain  MAXQ  and  then  our 
hierarchical  framework,  we  will  present  our  model  and  note  throughout  this  chapter  where 
the  key  pieces  were  inspired  by  or  are  directly  related  to  Dietterich’s  MAXQ  work.  In  the 
following  chapters,  we  first  extend  this  framework  to  the  average  reward  model,  then  we 
generalize  it  to  be  applicable  to  problems  with  continuous  state  and/or  action  spaces,  and 
finally  broaden  it  to  be  appropriate  for  domains  with  multiple  cooperative  agents. 

3.1  Motivating  Example 

In  the  HRL  framework,  the  designer  of  the  system  imposes  a  hierarchy  on  the  problem 
to  incorporate  domain  knowledge  and  thereby  reduces  the  size  of  the  space  that  must  be 
searched  to  find  a  good  policy.  The  designer  recursively  decomposes  the  overall  task  into  a 
collection  of  subtasks  that  she/he  believes  are  important  for  solving  the  problem. 

Let  us  illustrate  the  main  ideas  using  a  simple  search  task  shown  in  Figure  3.1.  Consider 
the  case  where,  in  an  office  (rooms  and  connecting  corridors)  type  environment,  a  robot  is 
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assigned  the  task  of  picking  up  trash  from  trash  cans  (XI  and  X 2)  over  an  extended  area 
and  accumulating  it  into  one  centralized  trash  bin  (Dump),  from  where  it  might  be  sent  for 
recycling  or  disposed.  For  simplicity,  we  assume  that  the  robot  can  observe  its  true  location 
in  the  environment.  The  main  subtasks  in  this  problem  are  root  (the  whole  trash  collection 
task),  collect  trash  at  XI  and  X 2,  navigate  to  XI,  X2,  and  Dump.  Each  of  these  subtasks 
is  defined  by  a  set  of  termination  states.  After  defining  subtasks,  we  must  indicate  for  each 
subtask,  which  other  subtasks  or  primitive  actions  it  should  employ  to  reach  its  goal.  For 
example,  navigate  to  XI,  X 2,  and  Dump  use  three  primitive  actions  find  wall,  align  with 
wall,  and  follow  wall.  Collect  trash  at  XI  uses  two  subtasks  navigate  to  XI  and  Dump, 
plus  two  primitive  actions  Put  and  Pick,  and  so  on.  Like  MAXQ,  all  of  this  information  can 
be  summarized  by  a  directed  acyclic  graph  called  the  task  graph.  The  task  graph  for  the 
trash  collection  problem  is  shown  in  Figure  3.1.  This  hierarchical  model  is  able  to  support 
state  abstraction  (while  the  agent  is  moving  toward  the  Dump,  the  status  of  trash  cans 
XI  and  X 2  is  irrelevant  and  cannot  affect  this  navigation  process.  Therefore,  the  variables 
defining  the  status  of  trash  cans  XI  and  X 2  can  be  removed  from  the  state  space  of  the 
navigate  to  Dump  subtask)  and  subtask  sharing  (if  the  system  could  learn  how  to  solve 
the  navigate  to  Dump  subtask  once,  then  the  solution  could  be  shared  by  both  collect  trash 
at  XI  and  X2  subtasks). 

Like  HAMs  (Parr,  1998),  options  (Sutton  et  al.,  1999;  Precup,  2000),  MAXQ  (Diet- 
terich,  2000),  and  PHAMs  (Andre  and  Russell,  2001;  Andre,  2003),  this  framework  also 
relies  on  the  theory  of  SMDPs.  While  SMDP  theory  provides  the  theoretical  underpinnings 
of  temporal  abstraction  by  modeling  actions  that  take  varying  amounts  of  time,  the  SMDP 
model  provides  little  in  the  way  of  concrete  representational  guidance,  which  is  critical 
from  a  computational  point  of  view.  In  particular,  the  SMDP  model  does  not  specify  how 
tasks  can  be  broken  up  into  subtasks,  how  to  decompose  value  functions,  etc.  We  examine 
these  issues  next. 
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Figure  3.1.  A  robot  trash  collection  task  and  its  associated  task  graph. 


As  in  MAXQ,  a  task  hierarchy  such  as  the  one  illustrated  above  can  be  modeled  by 
decomposing  the  overall  task  MDP  M.,  into  a  finite  set  of  subtasks  (M0,  Mi, . . .  ,  Mm_ i }, 
where  M0  is  the  root  task.  Solving  M0  solves  the  entire  MDP  M.. 

Definition  3.1:  Each  non-primitive  subtask  Mi  {Mi  is  not  a  primitive  action)  consists 
of  five  components  (S',.,  I,.  Tt,  A,:,  R, ) : 

•  Si  is  the  state  space  for  subtask  Mi.  It  is  described  by  those  state  variables  that  are 
relevant  to  subtask  Mi.  The  range  of  a  state  variable  describing  Si  might  be  a  subset 
of  its  range  in  S  (the  state  space  of  MDP  M.). 

•  Xj  C  3,  is  the  initiation  set  for  subtask  Mi.  Subtask  M*  can  be  initiated  only  in  states 
belonging  to  X \. 

•  Ti  C  Si  is  the  set  of  terminal  states  for  subtask  Mi.  Subtask  Mi  terminates  when 
it  reaches  a  state  in  X).  A  policy  for  subtask  Mj  can  only  be  executed  if  the  current 
state  s  belongs  to  {Si  —  Ti). 

•  Ai  is  the  set  of  actions  that  can  be  performed  to  achieve  subtask  Mi.  These  actions 
can  be  either  primitive  actions  from  A  (the  set  of  primitive  actions  for  MDP  M), 
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or  they  can  be  other  subtasks.  Technically,  is  a  function  of  states,  since  it  may 
differ  from  one  state  to  another.  However,  we  will  suppress  this  dependence  in  our 
notation. 

•  Ri  is  the  reward  structure  inside  subtask  Mt  and  could  be  different  from  the  reward 
function  of  MDP  M..  Here  we  use  the  idea  of  reward  shaping  (Ng  et  al.,  1999) 
and  define  a  more  general  reward  structure  than  MAXQ’s,  which  specifies  a  pseudo¬ 
reward  only  for  transitions  to  terminal  states.  Reward  shaping  is  a  method  for  guiding 
an  agent  toward  a  solution  without  constraining  the  search  space.  Besides  the  reward 
of  the  overall  task  MDP  A4,  each  subtask  M%  can  use  additional  rewards  to  guide 
its  local  learning.  Additional  rewards  are  only  used  inside  each  subtask  and  do  not 
propagate  to  upper  levels  in  the  hierarchy.  If  the  reward  structure  inside  a  subtask  is 
different  from  the  reward  function  of  the  overall  task,  we  need  to  define  two  types  of 
value  functions  for  each  subtask,  internal  value  function  and  external  value  function. 
The  internal  value  function  is  defined  based  on  both  the  local  reward  structure  of 
the  subtask  and  the  reward  of  the  overall  task,  and  only  used  in  learning  the  subtask. 
On  the  other  hand,  the  external  value  function  is  defined  only  based  on  the  reward 
function  of  the  overall  task  and  is  propagated  to  the  higher  levels  in  the  hierarchy 
to  be  used  in  learning  the  global  policy.  This  reward  structure  for  each  subtask  in 
our  framework  is  more  general  than  the  one  in  MAXQ,  and  of  course,  includes  the 
MAXQ’s  pseudo-reward.1  □ 

Each  primitive  action  a  is  a  primitive  subtask  in  this  decomposition,  such  that  a  is 
always  executable  and  it  terminates  immediately  after  execution.  From  now  on  in  this 
thesis,  we  use  subtask  to  refer  to  non-primitive  subtasks. 

'The  MAXQ  pseudo-reward  function  is  defined  only  for  transitions  to  terminal  states,  and  is  zero  for 
non-terminal  states. 
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3.2  Policy  Execution 

If  we  have  a  policy  for  each  subtask  in  the  hierarchy,  we  can  define  a  hierarchical  pol¬ 
icy  for  the  model. 

Definition  3.2:  A  hierarchical  policy  /x  is  a  set  of  policies,  one  policy  for  each  of  the 
subtasks  in  the  hierarchy:  /x  =  {/x0, . . .  ,  /xm_ 1}-  □ 

The  hierarchical  policy  is  executed  using  a  stack  discipline,  similar  to  ordinary  program¬ 
ming  languages.  Each  subtask  policy  takes  a  state  and  returns  the  name  of  a  primitive 
action  to  execute  or  the  name  of  a  subtask  to  invoke.  When  a  subtask  is  invoked,  its  name 
is  pushed  onto  the  Task-Stack  and  its  policy  is  executed  until  it  enters  one  of  its  terminal 
states.  When  a  subtask  terminates,  its  name  is  popped  off  the  Task-Stack.  If  any  subtask  on 
the  Task-Stack  terminates,  then  all  subtasks  below  it  are  immediately  aborted,  and  control 
returns  to  the  subtask  that  had  invoked  the  terminated  subtask.  Hence,  at  any  time,  the  root 
task  is  located  at  the  bottom  and  the  subtask  which  is  currently  being  executed  is  located  at 
the  top  of  the  Task-Stack. 

Under  a  hierarchical  policy  fi,  we  define  a  multi-step  transition  probability  Pjl  :  Si  x 
IN  x  Si  — >  [0,1]  for  each  subtask  Mt  in  the  hierarchy,  where  P^s' ,  N\s)  denotes  the 
probability  that  hierarchical  policy  /x  will  cause  the  system  to  transition  from  state  s  to 
state  s'  in  N  time  steps  at  subtask  M,.  We  also  define  a  multi-step  abstract  transition 
probability  F/'  :  Si  x  IN  x  Si  — >  [0, 1]  for  each  subtask  M,  under  the  hierarchical  policy 
/x.  The  term  ZV|s)  denotes  the  Ar-stcp  abstract  transition  probability  from  state  s  to 

state  s'  under  hierarchical  policy  /x  at  subtask  M,:,  where  N  is  the  number  of  actions  taken 
by  subtask  Mt,  not  the  number  of  primitive  actions  taken  in  this  transition.  In  this  thesis,  we 
use  the  multi-step  abstract  transition  probability  F/J  to  model  state  transition  at  the  subtask 
level,  and  the  multi-step  transition  probability  Z3/'  to  model  state  transition  at  the  level  of 
primitive  actions.  Finally,  we  define  a  single-step  transition  probability  P,J  :  S  x  S-  [0,1] 
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under  the  hierarchical  policy  /x,  where  PIJ'(s'\s')  denotes  the  probability  that  the  hierarchical 
policy  /x  will  cause  the  system  to  transition  from  state  s  to  state  s'  at  the  level  of  primitive 
actions. 

3.3  Local  versus  Global  Optimality 

Using  hierarchy  reduces  the  size  of  the  space  that  must  be  searched  to  find  a  good  pol¬ 
icy.  However,  a  hierarchy  constrains  the  space  of  possible  policies  so  that  it  may  not  be 
possible  to  represent  the  optimal  policy  or  its  value  function,  and  hence  make  it  impossi¬ 
ble  to  leam  the  optimal  policy.  If  we  cannot  learn  the  optimal  policy,  the  next  best  target 
would  be  to  leam  the  best  policy  that  is  consistent  with  the  given  hierarchy.  Two  notions  of 
optimality  have  been  explored  in  the  previous  work  on  hierarchical  reinforcement  learning, 
hierarchical  optimality  and  recursive  optimality  (Dietterich,  2000). 

Definition  3.3:  A  hierarchical  optimal  policy  for  MDP  Ad  is  a  hierarchical  policy  which 
has  the  best  performance  among  all  policies  consistent  with  the  given  hierarchy.  In  other 
words,  hierarchical  optimality  is  a  global  optimum  consistent  with  the  given  hierarchy.  In 
this  form  of  optimality,  the  policy  for  each  individual  subtask  is  not  necessarily  optimal, 
but  the  policy  for  the  entire  hierarchy  is  optimal.  The  HAMQ  HRL  algorithm  (Parr,  1998) 
and  the  SMDP  Q-leaming  algorithm  for  a  fixed  set  of  options  (Sutton  et  al.,  1999;  Precup, 
2000)  both  converge  to  a  hierarchically  optimal  policy.  □ 

Definition  3.4:  Recursive  optimality,  first  introduced  by  Dietterich  (2000),  is  a  weaker 
but  more  flexible  form  of  optimality  which  only  guarantees  that  the  policy  of  each  sub¬ 
task  is  optimal  given  the  policies  of  its  children.  It  is  an  important  and  flexible  form  of 
optimality  because  it  permits  each  subtask  to  learn  a  locally  optimal  policy  while  ignoring 
the  behavior  of  its  ancestors  in  the  hierarchy.  This  increases  the  opportunity  for  subtask 


47 


sharing  and  state  abstraction.  The  MAXQ-Q  HRL  algorithm  (Dietterich,  2000)  converges 
to  a  recursively  optimal  policy.  □ 

3.4  Value  Function  Defi  nitions 

For  recursive  optimality,  the  goal  is  to  find  a  hierarchical  policy  fx  =  {//0, . . .  ,  Hm- 1} 
such  that  for  each  subtask  M,:  in  the  hierarchy,  the  expected  cumulative  reward  of  execut¬ 
ing  policy  Hi  and  the  policies  of  all  descendants  of  Mi  is  maximized.  In  this  case,  the 
value  function  to  be  learned  for  subtask  M*  under  hierarchical  policy  fi  must  contain  only 
the  reward  received  during  the  execution  of  subtask  M*.  We  call  this  the  projected  value 
function  after  Dietterich  (2000),  and  define  it  as  follows: 

Definition  3.5:  The  projected  value  function  of  a  hierarchical  policy  fx  on  subtask  Mr, 
denoted  V,J  (i.  s ),  is  the  expected  cumulative  reward  of  executing  policy  fi,  and  the  policies 
of  all  descendants  of  Mi  starting  in  state  s  G  Si  until  Mt  terminates.  □ 

The  expected  cumulative  reward  outside  a  subtask  is  not  a  part  of  its  projected  value  func¬ 
tion.  It  makes  the  projected  value  function  of  a  subtask  dependent  only  on  the  subtask  and 
its  descendants. 

On  the  other  hand,  for  hierarchical  optimality,  the  goal  is  to  find  a  hierarchical  pol¬ 
icy  that  maximizes  the  expected  cumulative  reward.  In  this  case,  the  value  function  to  be 
learned  for  subtask  M%  under  hierarchical  policy  [x  must  contain  the  reward  received  during 
the  execution  of  subtask  Mt,  and  the  reward  after  subtask  Mt  terminates.  We  call  this  the 
hierarchical  value  function  following  Dietterich  (2000).  The  hierarchical  value  function 
of  a  subtask  includes  the  expected  reward  outside  the  subtask  and  therefore  depends  on 
the  subtask  and  all  its  ancestors  up  to  the  root  of  the  hierarchy.  In  the  case  of  hierarchical 
optimality,  we  need  to  consider  the  contents  of  the  Task-Stack  as  an  additional  part  of  the 
state  space  of  the  problem,  since  a  subtask  might  be  shared  by  multiple  parents. 


48 


Definition  3.6:  Q  is  the  space  of  possible  values  of  the  Task-Stack  for  hierarchy  H.  □ 

Let  us  define  joint  state  space  X  =  Q  x  S  for  the  hierarchy  Ti.  as  the  cross  product  of 
the  set  of  the  Task-Stack  values  and  the  states  space  S.  We  define  the  hierarchical  value 
function  using  joint  state  space  X  as 

Definition  3.7:  A  hierarchical  value  function  for  subtask  M,  in  state  x  =  (u,s)  under 
hierarchical  policy  fi,  denoted  x),  is  the  expected  cumulative  reward  of  following 
the  hierarchical  policy  /x  starting  in  state  s  G  Si  and  Task-Stack  u.  □ 

The  current  subtask  M,  is  a  part  of  the  Task-Stack  uj  and  as  a  result  is  a  part  of  the  state 
x.  So  we  can  exclude  it  from  the  hierarchical  value  function  notation  and  write  DM(i,  x)  as 
VfJ‘(x).  However  for  clearance,  we  use  DM(i,  x)  in  the  rest  of  this  dissertation. 

Theorem  3.1:  Under  a  hierarchical  policy  fx.  each  subtask  M%  can  be  modeled  by  an 
SMDP  consisting  of  components  (Si:  Ai}  PR,  RR,  where  Va  G  A{,  Ri(s,  a)  =  UM(a,  s).  □ 

This  theorem  is  similar  to  Theorem  1  in  Dietterich  (2000).  Using  this  theorem,  we  can  de¬ 
fine  a  recursive  optimal  policy  for  MDP  A4  with  hierarchical  decomposition  {M0,  M\ . . . . 

,  Mm_ i}  as  a  hierarchical  policy  /i  =  {/x0,  •  •  •  ,  1}  such  that  for  each  subtask  Mu  the 

corresponding  policy  /i,  is  optimal  for  the  SMDP  defined  by  the  tuple  (Si,  Ar.  PR .  R, ) . 

3.5  Value  Function  Decomposition 

A  value  function  decomposition  splits  the  value  of  a  state  or  a  state-action  pair  into 
multiple  additive  components.  Modularity  in  the  hierarchical  structure  of  a  task  allows  us 
to  carry  out  this  decomposition  along  subtask  boundaries.  In  this  section,  we  first  describe 
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the  two-part  or  MAXQ  decomposition  proposed  by  Dietterich  (2000),  and  then  the  three- 
part  decomposition  proposed  by  Andre  and  Russell  (2002).  We  use  both  decompositions  in 
our  hierarchical  framework  depending  on  the  type  of  optimality  (hierarchical  or  recursive) 
that  we  are  interested  in. 

The  two-part  value  function  decomposition  is  at  the  center  of  the  MAXQ  method.  The 
purpose  of  this  decomposition  is  to  decompose  the  projected  value  function  of  the  root  task, 
RM(0,  .S'),  in  terms  of  the  projected  value  functions  of  all  of  the  subtasks  in  the  hierarchy. 
The  projected  value  of  subtask  Mt  at  state  s  under  hierarchical  policy  fx  can  be  written  as 


=  E 
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(3.1) 


Now  let  us  suppose  that  the  first  action  chosen  by  /q  is  invoked  and  it  executes  for  a  number 
of  steps  N  and  terminates  in  state  s'  according  to  P-l(s'.  iVjs).  We  can  re-write  Equation 
3.1  as 


V»(i,s)  =  E 


TV— 1  oo 

^  7 kr(sk,  ak)  +  V'  7 kr(sk,  afc)|s0  =  s,n 


Z^ 

,k= 0 


Z^ 

k=N 


(3.2) 


The  first  summation  on  the  right-hand  side  of  Equation  3.2  is  the  discounted  sum  of  re¬ 
wards  for  executing  subtask  Hi{s)  starting  in  state  s  until  it  terminates,  in  other  words,  it 
is  RM(/i,:(s),  s),  the  projected  value  function  of  the  child  task  /Xj(s).  The  second  term  on 
the  right-hand  side  of  the  equation  is  the  projected  value  of  state  s'  for  the  current  task  Mi, 
V,l(i.  s'),  discounted  by  7iV,  where  s'  is  the  current  state  when  subroutine  /q(s)  terminates 
and  N  is  the  number  of  transition  steps  from  state  s  to  state  s'.  We  can  therefore  write 
Equation  3.2  in  the  form  of  a  Bellman  equation: 


V»(i,  s )  =  V»{fH{s),  s)  +  J2  P?{s',  Nls^V^i,  s')  (3.3) 

s',N 


50 


Equation  3.3  can  be  re-stated  for  the  projected  action-value  function  as  follows: 

Q^i,  s,  a)  =  1 >(a,  s)  +  J2  Pt^ ,  N\s,  a^Q^i,  s',  ^s'))  (3.4) 

s'  ,N 

The  right-most  term  in  this  equation  is  the  expected  discounted  cumulative  reward  of  com¬ 
pleting  subtask  Mi  after  executing  subtask  Ma  in  state  s.  Dietterich  called  this  term  com¬ 
pletion  function  and  is  denoted  by  s,  a).  With  this  definition,  we  can  express  the 
projected  action-value  function  recursively  as 

Q^[i,  s,  a)  =  1 >(a,  s)  +  C^i,  s,  a)  (3.5) 


and  we  can  re-express  the  definition  for  projected  value  function  as 


|  T,S'P(S' |s,i)r(s,i) 


if  Mi  is  a  non-primitive  subtask, 
if  Mi  is  a  primitive  action. 


(3.6) 


Equations  3.5  and  3.6  are  referred  to  as  two-part  decomposition  equations  for  a  hierarchy 
under  a  fixed  hierarchical  policy  //.  These  equations  recursively  decompose  the  projected 
value  function  for  the  root  into  the  projected  value  functions  for  the  individual  subtasks, 
Mi, . . .  ,  Mm_i,  and  the  individual  completion  functions  s,  a)  for  j  =  1, . . .  ,  rn  —  1. 

The  fundamental  quantities  that  must  be  stored  to  represent  the  value  function  decompo¬ 
sition  are  the  C  values  for  all  non-primitive  subtasks  and  the  V  values  for  all  primitive 
actions.2  The  two-part  decomposition  is  summarized  graphically  in  Figure  3.2.  As  men¬ 
tioned  in  Section  3.4,  since  the  expected  reward  after  execution  of  subtask  M%  is  not  a 
component  of  the  projected  action-value  function,  the  two-part  decomposition  allows  only 
for  recursive  optimality. 


2The  projected  value  function  and  value  function  are  the  same  for  a  primitive  action. 
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o — a 


Execution  of  Subtask  a 

Figure  3.2.  This  figure  shows  the  two-part  decomposition  for  V ( i ,  s),  the  projected  value 
function  of  subtask  Mr  for  the  shaded  state  s.  Each  circle  is  a  state  of  the  SMDP  visited  by 
the  agent.  Subtask  M,  is  initiated  at  state  sj  and  terminates  at  state  s  j-  The  projected  value 
function  V {i,  s )  is  broken  into  two  parts:  Part  1)  the  projected  value  function  of  subtask 
Ma  for  state  s,  and  Part  2)  the  completion  function,  the  expected  discounted  cumulative 
reward  of  completing  subtask  M,  after  executing  subtask  Ma  in  state  s. 


Andre  and  Russell  (2002)  proposed  a  three-part  value  function  decomposition  for  achiev¬ 
ing  hierarchical  optimality.  They  add  a  third  component  for  the  expected  sum  of  rewards 
outside  the  current  subtask  to  the  two-part  value  function  decomposition.  This  decomposi¬ 
tion  decomposes  the  hierarchical  value  function  of  each  subtask  into  three  parts.  As  shown 
in  Figure  3.3,  these  three  parts  correspond  to  executing  the  current  action  (which  might  it¬ 
self  be  a  subtask),  completing  the  rest  of  the  current  subtask  (so  far  is  similar  to  the  MAXQ 
decomposition),  and  all  actions  outside  the  current  subtask. 
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V(i,x) 


Part  3 


Execution  of  Subtask  a 

Figure  3.3.  This  figure  shows  the  three-part  decomposition  for  V(i,x),  the  hierarchical 
value  function  of  subtask  Mj  for  the  shaded  state  x  =  (u>,  s).  Each  circle  is  a  state  of  the 
SMDP  visited  by  the  agent.  Subtask  Mt  is  initiated  at  state  xj  and  terminates  at  state 
The  hierarchical  value  function  V(i,x)  is  broken  into  three  parts:  Part  1)  the  projected 
value  function  of  subtask  Ma  for  state  s.  Part  2)  the  completion  function,  the  expected 
discounted  cumulative  reward  of  completing  subtask  Mt  after  executing  subtask  Ma  in 
state  s,  and  Part  3)  the  sum  of  all  rewards  after  termination  of  subtask  Mj. 
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CHAPTER  4 


HIERARCHICAL  AVERAGE  REWARD  REINFORCEMENT 

LEARNING 


As  described  in  Chapter  2,  the  average-reward  formulation  is  more  appropriate  for  a 
wide  class  of  continuing  tasks  than  more  well-studied  discounted  reward  framework.  A 
primary  goal  of  continuing  tasks,  including  manufacturing,  scheduling,  queuing,  and  in¬ 
ventory  control,  is  to  find  a  gain-optimal  policy  that  maximizes  (minimizes)  the  long-run 
average  reward  (cost)  over  time.  Although  average  reward  reinforcement  learning  (RL) 
has  been  studied  using  both  the  discrete-time  MDP  model  (Schwartz,  1993;  Mahadevan, 
1996;  Tadepalli  and  Ok,  1996;  Marbach,  1998;  Van-Roy,  1998)  as  well  as  the  continuous¬ 
time  SMDP  model  (Mahadevan  et  al.,  1997b;  Wang  and  Mahadevan,  1999),  prior  work  has 
been  limited  to  flat  policy  representations. 

In  this  chapter,1  we  extend  previous  work  on  hierarchical  reinforcement  learning  (HRL) 
to  the  average  reward  framework,  and  investigate  two  formulations  of  HRL  based  on  the 
average  reward  SMDP  model.  These  two  formulations  correspond  to  two  notions  of  opti¬ 
mality  in  HRL:  hierarchical  optimality  and  recursive  optimality  described  in  Section  3.3. 
We  present  discrete-time  and  continuous-time  algorithms  that  leam  to  find  hierarchically 
and  recursively  optimal  average  reward  policies.  In  these  algorithms,  we  assume  that  the 
overall  task  (the  root  of  the  hierarchy)  is  continuing.  In  the  hierarchically  optimal  average 
reward  RL  (HAR)  algorithms,  the  aim  is  to  find  a  hierarchical  policy  within  the  space  of 

'Most  of  the  work  presented  in  this  chapter  first  appeared  in  1)  Ghavamzadeh  and  Mahadevan  (2001), 
‘Continuous-Time  Hierarchical  Reinforcement  Learning,”  Proceedings  of  the  Eighteenth  International  Con¬ 
ference  on  Machine  Learning",  pp.  186-193,  and  2)  Ghavamzadeh  and  Mahadevan  (2002),  ‘Hierarchically 
Optimal  Average  Reward  Reinforcement  Learning,”  Proceedings  of  the  Nineteenth  International  Conference 
on  Machine  Learning",  pp.  195-202. 
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policies  defined  by  the  hierarchical  decomposition  that  maximizes  the  global  gain.  In  the 
recursively  optimal  average  reward  RL  (RAR)  algorithms,  we  treat  subtasks  as  continu¬ 
ing  average  reward  problems,  where  the  goal  at  each  subtask  is  to  maximize  its  gain  given 
the  policies  of  its  children.  We  investigate  the  conditions  under  which  the  policy  learned  by 
the  RAR  algorithm  at  each  subtask  is  independent  of  the  context  in  which  it  is  executed  and 
therefore  can  be  reused  by  other  hierarchies.  We  use  two  experimental  testbeds  to  study  the 
empirical  performance  of  the  proposed  algorithms.  The  first  problem  is  a  small  automated 
guided  vehicle  (AGV)  scheduling  task.  The  second  problem  is  a  relatively  large  AGV 
scheduling  task.  We  model  the  second  AGV  task  using  both  discrete-time  and  continuous¬ 
time  models.  We  compare  the  performance  of  our  proposed  algorithms  with  other  HRL 
methods  and  a  flat  average  reward  RL  algorithm  in  this  task. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  4.1,  we  present  discrete¬ 
time  and  continuous-time  hierarchically  optimal  average  reward  RL  (HAR)  algorithms.  In 
Section  4.2,  we  investigate  different  methods  to  formulate  subtasks  in  a  recursively  optimal 
average  reward  RL  setting,  and  present  discrete-time  and  continuous-time  recursively  opti¬ 
mal  average  reward  RL  (RAR)  algorithms.  We  demonstrate  the  type  of  optimality  achieved 
by  HAR  and  RAR  algorithms  as  well  as  their  performance  and  speed  compared  to  other 
algorithms  in  Section  4.3.  Finally,  Section  4.4  summarizes  the  chapter  and  discusses  some 
directions  for  future  work. 

4.1  Hierarchically  Optimal  Average  Reward  RL  Algorithm 

Given  the  basic  concepts  of  the  average  reward  MDP  and  the  average  reward  SMDP 
models  described  in  Sections  2.2.3  and  2.3.2,  and  the  fundamental  principles  of  HRL  and 
the  HRL  framework  illustrated  in  Chapter  3,  we  can  now  proceed  to  describe  a  hierar¬ 
chically  optimal  average  reward  RL  formulation.  Since  we  are  interested  in  hierarchical 
optimality,  we  include  the  contents  of  the  Task-Stack  as  a  part  of  the  state  space  of  the 
problem.  In  this  section,  we  consider  HRL  problems  for  which  the  following  assumptions 
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hold. 


Assumption  4.1  (Continuing  Root  Task):  The  root  of  the  hierarchy  is  a  continuing  task, 
i.e.,  the  root  task  goes  on  continually  without  termination.  □ 

Assumption  4.2:  For  every  hierarchical  policy  fi,  the  single-step  transition  probability 
matrix  P1'  is  unichain,  that  is,  it  consists  of  a  single  recurrent  class  plus  a  possibly  empty 
set  of  transient  states.  □ 

If  Assumptions  4.1  and  4.2  hold,  using  Equation  2.5,  the  gain2 

1  N~' 

9^  =  iim  m  y2(pf*)tr(x’  x ))  =  P^r(x,  fi(x))  (4.1) 

iv— >00  iV  z J 
t= 0 

is  well  defined  for  every  hierarchical  policy  //  and  does  not  depend  on  the  initial  state.  We 
call  gfJl  the  global  gain  under  the  hierarchical  policy  p.  The  global  gain,  <yM.  is  the  gain  of 
the  Markov  chain  that  will  result  from  flattening  the  hierarchy  using  the  hierarchical  policy 
P- 

We  are  interested  in  finding  a  hierarchical  policy  p*  which  maximizes  the  global  gain, 
i.e., 


gU*  >  fQr  a]l  n  (4.2) 

We  refer  to  a  hierarchical  policy  /T  which  satisfies  Equation  4.2  as  a  hierarchically  optimal 
average  reward  policy,  and  to  g'1"  as  the  optimal  average  reward  or  the  optimal  gain. 

2Under  the  unichain  assumption,  P1'  has  equal  rows.  Therefore,  the  right  hand  side  of  Equation  4.1  is  a 
vector  with  elements  all  equal  to  g 
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Here  we  replace  value  and  action-value  functions  in  the  hierarchical  model  of  Chapter 
3  with  average-adjusted  value  and  average- adjusted  action-value  functions  described  in 
Sections  2.2.3  and  2.3.2. 

The  hierarchical  average-adjusted  value  function  for  hierarchical  policy  fx  and  sub¬ 
task  Mi ,  denoted  is  the  average-adjusted  sum  of  rewards  earned  by  following 

hierarchical  policy  fx  starting  in  state  x  =  (cu,  s)  until  M,  terminates,  plus  the  expected 
average-adjusted  reward  outside  subtask  Mt. 


f/M(i,  x) 


N- 1 


lim  E 

N—>oo 


[r(xk,ak)  -  g \x0  =  x,n 


(4.3) 


Here  the  rewards  are  adjusted  with  g^,  the  global  gain  under  the  hierarchical  policy  [x. 

Now  let  us  suppose  that  the  first  action  chosen  by  /i,:  is  executed  for  a  number  of  prim¬ 
itive  steps  N\  and  terminates  in  state  x\  =  (x,  s\ )  according  to  multi-step  transition  proba¬ 
bility  P^(x i,  JVi|z,  and  after  that  subtask  Mi  itself  executes  for  N2  steps  at  the  level 

of  subtask  Mt  (N2  is  the  number  of  actions  taken  by  subtask  Mt  ,  not  the  number  of  primi¬ 
tive  actions)  and  terminates  in  state  x2  =  (c o,  s2)  according  to  multi-step  abstract  transition 
probability  Ff1(x 2,  N2\xi).  We  can  re-write  Equation  4.3  in  the  form  of  a  Bellman  equation 
as 


x)  =  rf  (x,  m{x))  -  g^y?(x,  m{x))  + 


(4.4) 


P  t(xi,  iVi|x,  Hi{x)) 

JVi.sieSi 


H^(i,x i)  +  Yj  Fil(x2,N2\xi)HtJ‘(Parent(i),(uj  /*  i,  s2)) 

N2,S2(zS-i 


where  .)  is  the  projected  average-adjusted  value  function  of  hierarchical  policy  // 

and  subtask  Mt,  y? (x,  n,  (x))  is  the  expected  number  of  time  steps  until  the  next  decision 
epoch  of  subtask  Mt  after  taking  action  n,(x)  in  state  x  and  following  hierarchical  policy  /i 
afterward,  and  u  /*  i  is  the  content  of  the  Task-Stack  after  popping  subtask  M,  off.  Notice 
that  H  does  not  contain  the  average-adjusted  rewards  outside  the  current  subtask  and  should 
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be  distinguished  from  the  hierarchical  average-adjusted  value  function  H  which  includes 
the  sum  of  average-adjusted  rewards  outside  the  current  subtask. 

Since  rf  (x,  Hi(x))  is  the  expected  reward  between  two  decision  epochs  of  subtask  Mu 
given  that  the  system  occupies  state  x  at  the  first  decision  epoch,  and  the  agent  chooses 
action  /ii(x),  we  have 

rf{x,  m(x))  =  V"M(//j(x),(/Xj(x)  \  w,s))  =  (m( x )  \  u,s))  +  g^yfix,  fj,i(x)) 

where  Hi(x)  \  u  is  the  content  of  the  Task-Stack  after  pushing  subtask  iJt(x)  onto  it.  By 
replacing  rf  (x,  /i,  (x)  )  from  the  above  expression.  Equation  4.4  can  be  written  as 

x)  =  (m(x)  \  u,  s))  + 

(4.5) 

y,  p?(x i,Ni\x,/Ji(x))  H^(i,x i)  +  F^(x2,N2\x1)Htl(Parent(i),(u}  /"  i,s2)) 

Ni,siGSi  N2,S2£Si 

We  can  restate  Equation  4.5  for  hierarchical  average-adjusted  action-value  function  as 

( i,  x,  a)  =  i7M(a,  (a  \  u>,  s))  +  E  PtlxuN^a) 

Ni,si(zSi 

(4.6) 

H^{i,xi)  +  Ef  (x2,  N2\xi)L^{Parent{i),  (u  /  i,  s2),nparent^ (u>  /  i,s2)) 

N2,S2&Si 

From  Equation  4.6,  we  can  re-express  the  hierarchical  average-adjusted  action-value  func¬ 
tion  L  recursively  as 

x,  a)  =  E7M(a,  (a  \  c v,  s))  +  x,  a)  +  CE^ii,  x,  a)  (4.7) 
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where 


Ctt(i,x,a)=  Pti^N^x, 0)6^1, Xl)  (4.8) 

JVi,sieSi 

and 

CE^[i,  x,  a)  =  E  rrixuN^a) 

NusieSi 

(4.9) 

Y  Ff(x2,  N2\x1)LfX(Parent{i),  (u  /  i,  s2),  MParent(i)(^  /  M2)) 

N2,s2eSi 

The  term  C^ii^x^a)  is  the  expected  average-adjusted  reward  of  completing  subtask  M, 
after  executing  action  a  in  state  x  =  (u>,  s ).  We  call  this  term  completion  function  after 
Dietterich  (2000).  The  term  CE^ii,  x,  a)  is  the  expected  average- adjusted  reward  received 
after  subtask  Mt  terminates.  We  call  this  term  external  completion  function  after  Andre 
and  Russell  (2002). 

We  can  re-express  the  definition  of  H  as 

{LM(i,  x,  Hi(x))  if  Mi  is  a  non-primitive  subtask, 

(4.10) 

r(s,  i )  —  g M  if  M,  is  a  primitive  action, 

where  IJJ  is  the  projected  average-adjusted  action-value  function  and  can  be  written  as 

x,  a )  =  fTM(a,  (a  \  u,  s))  +  CM(f,  x,  a)  (4.11) 

Equations  4.7  to  4.11  are  the  decomposition  equations  under  a  hierarchical  policy 
//.  These  equations  recursively  decompose  the  hierarchical  average-adjusted  value  func¬ 
tion  for  the  root,  // /J  (C).  x),  into  the  projected  average-adjusted  value  functions  IP1  for 
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the  individual  subtasks  Mi, . . .  ,  Mm_i  in  the  hierarchy,3  the  individual  completion  func¬ 
tions  x,  a)  for  i  =  1, . . .  ,  rn  —  1 ,  and  the  individual  external  completion  functions 
CE^ii,  x,  a)  for  i  —  1, . . .  ,  m  —  1.  The  fundamental  quantities  that  must  be  stored  to  repre¬ 
sent  the  hierarchical  average-adjusted  value  function  decomposition  are  the  C  and  the  CE 
values  for  all  non-primitive  subtasks,  the  H  values  for  all  primitive  actions,  and  the  global 
gain  g The  decomposition  equations  can  be  used  to  obtain  update  equations  for  H,  C, 
and  CE  in  this  hierarchically  optimal  average  reward  model.  Pseudo-code  for  the  discrete¬ 
time  hierarchically  optimal  average  reward  RL  (HAR)  algorithm  is  shown  in  Algorithm 
1.  In  this  algorithm,  primitive  subtasks  update  their  projected  average-adjusted  value  func¬ 
tions  H  (Line  5),  while  non-primitive  subtasks  update  both  their  completion  functions  C 
(Line  17),  and  external  completion  functions  CE  (Lines  19  and  21).  We  store  only  one 
global  gain  g  and  update  it  after  each  non-random  primitive  action  (Line  7).  In  the  update 
formula  on  Line  17,  the  projected  average-adjusted  value  function  H(a,  (a  \  u,  s))  is  the 
reward  of  executing  action  a  in  state  (c o,  s)  under  subtask  M%  and  is  recursively  calculated 
by  subtask  Ma  and  its  descendants  using  Equations  4.10  and  4.11.  Notice  that  the  hierar¬ 
chical  average-adjusted  action-value  function  L  on  Lines  15  and  19  is  recursively  evaluated 
using  Equation  4.7. 

This  algorithm  can  be  easily  extended  to  continuous-time  by  changing  the  update  for¬ 
mulas  for  H  and  g  on  Lines  5  and  7  as 

Ht+i(i,x)^[l  -  Oit(i)]Ht{i,x)  +  at{i)  [k(s,i)  +  r(s,i)r(s,i)  -  gtr(s,i )] 


9t+ 1 


r_t+ i 

tt+i 


rt  +  k(s,  i )  +  r(s,  i)r(s,  i ) 
tt  +  r(s,i) 


3m  is  the  total  number  of  subtasks  in  the  hierarchy. 
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Algorithm  1  Discrete-time  hierarchically  optimal  average  reward  RL  (HAR)  algorithm. 

1:  Function  HAR(Task  M,,  State  x  =  (u,  s)) 

2:  let  Seq  ={  }be  the  sequence  of  states  visited  while  executing  subtask  Mr 
3:  if  Mi  is  a  primitive  action  then 

4:  execute  action  i  in  state  x,  observe  state  x'  =  (cu,  s')  and  reward  r(s,  i ) 

5:  Ht+1(i,x )  <-  [1  -  at(i)]Ht(i,x)  +  at(i)[r(s,i)  ~  9t] 

6:  if  Mi  and  all  its  ancestors  are  non-random  actions  then 

7:  update  the  global  gain  gt+1  = 

8:  end  if 

9:  push  state  x\  =  (cu  i,  s)  into  the  beginning  of  Seq 

10:  else  /*  M%  is  a  non-primitive  subtask  */ 

11:  while  Mi  has  not  terminated  do 

12:  choose  action  (subtask)  a  according  to  the  current  exploration  policy  ji,  (x) 

13:  let  ChildSeq=RAR(Ma,  ( a  \  u>,  s)),  where  ChildSeq  is  the  sequence  of  states 

visited  while  executing  subtask  Ma 
14:  observe  result  state  x'  =  (u,  s') 

15:  let  a*  =  arg  maxa,ej4./s,)  Lt(i,  x',  a') 

16:  for  each  x  =  (u,  s)  in  ChildSeq  from  the  beginning  do 

17:  Ct+i(i,  x,  a)  <—  [1  -  at(i)\Ct(i,  x,  a)  +  at(i)  Ht(a* ,  (a*  \  w,s'))  +  Ct(i,x',a*) 

18:  if  s'  G  Ti  (s'  belongs  to  T,  the  set  of  terminal  states  of  subtask  M,)  then 

19:  CEt+i(i,x,a)  <—  [1  —  at(i)]CEt(i,x,a)  +  at(i)Lt(Parent(i),  (u>  S'  i,s'),a*) 

20:  else  /*  s'  is  not  a  terminal  state  of  subtask  M%  */ 

21:  CEt+1(i,x,a)  <—  [1  -  at(i)]CEt(i,x,a)  +  at(i)CEt(i,x' ,a*) 

22:  end  if 

23:  replace  state  x  =  (u>,  s)  with  (u>  /*  h  s)  m  the  ChildSeq 

24:  end  for 

25:  append  ChildSeq  onto  the  front  of  Seq 

26:  X  —  x' 

27:  end  while 

28:  end  if 
29:  return  Seq 
30:  end  HAR 
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where  r(s,i)  is  the  time  elapsing  between  state  s  and  the  next  state,  k(s,i )  is  the  fixed 
reward  of  taking  action  i  in  state  s,  and  r(s,  i )  is  the  reward  rate  for  the  time  between  state 
s  and  the  next  state. 

4.2  Recursively  Optimal  Average  Reward  RL 

In  the  previous  section,  we  introduced  discrete-time  and  continuous-time  hierarchically 
optimal  average  reward  RL  (HAR)  algorithms.  In  HAR  algorithm,  we  define  only  a  global 
gain  for  the  entire  hierarchy  to  guarantee  hierarchical  optimality  for  the  overall  task.  The 
HAR  algorithm  finds  a  hierarchical  policy  that  has  the  highest  global  gain  among  all  poli¬ 
cies  consistent  with  the  given  hierarchy.  However,  there  may  exist  subtasks  where  their 
policies  must  be  locally  suboptimal  so  that  the  overall  policy  becomes  optimal.  Recursive 
optimality  is  a  kind  of  local  optimality  in  which  the  policy  at  each  node  is  optimal  given 
the  policies  of  its  children  (See  Section  3.3).  Thus,  the  goal  at  root  is  to  maximize  its  gain 
given  the  policies  for  its  descendants.  The  reason  to  seek  recursive  optimality  rather  than 
hierarchical  optimality  is  that  recursive  optimality  makes  it  possible  to  solve  each  subtask 
without  reference  to  the  context  in  which  it  is  executed,  and  therefore  the  learned  subtask 
can  be  reused  by  other  hierarchies.  This  leaves  open  the  question  of  what  local  optimal¬ 
ity  criterion  should  be  used  for  each  subtask  in  a  recursively  optimal  average  reward  RL 
setting. 

One  approach  pursued  by  Seri  and  Tadepalli  (2002)  is  to  optimize  subtasks  using  their 
expected  total  average-adjusted  reward  with  respect  to  global  gain.  Seri  and  Tadepalli 
introduced  a  model-based  algorithm  called  Hierarchical  H-Learning  (HH-Learning).  For 
every  subtask,  this  algorithm  leams  the  action  model  and  maximizes  the  expected  total 
average-adjusted  reward  with  respect  to  global  gain  at  each  state.  In  this  method,  the 
projected  average-adjusted  value  functions  with  respect  to  global  gain  satisfy  the  following 
equations: 
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if  Mj  is  a  primitive  action, 


r(s,  i )  — 

s  0  if  s  is  a  terminal  state  of  subtask  Mu 

maxaeAj(s)[^(a,  s)  +  Ejv^'eSi  P?(s'i  N\s>  s')]  otherwise. 

(4.12) 


The  first  term  of  the  last  part  of  Equation  4.12,  JTM(a,  s),  denotes  the  expected  total  average- 
adjusted  reward  during  the  execution  of  subtask  Ma  (the  projected  average  adjusted  value 
function  of  subtask  Ma),  and  the  second  term  denotes  the  expected  total  average-adjusted 
reward  from  then  on  until  the  completion  of  subtask  Mt  (the  completion  function  of  sub¬ 
task  Mi  after  execution  of  subtask  Ma ).  Since  the  expected  average-adjusted  reward  after 
execution  of  subtask  Mi  is  not  a  component  of  the  average-adjusted  value  function  of  sub¬ 
task  Mi,  this  approach  does  not  necessarily  allow  for  hierarchical  optimality,  as  we  will 
show  in  the  experiments  of  Section  4.3.  Moreover,  the  policy  learned  for  each  subtask 
using  this  approach  is  not  context  free,  since  each  subtask  maximizes  its  average-adjusted 
reward  with  respect  to  global  gain.  However,  Seri  and  Tadepalli  (2002)  showed  that  this 
method  finds  the  hierarchically  optimal  average  reward  policy  when  the  result  distribution 
invariance  (RDI)  condition  holds. 


Definition  4.1  (Result  Distribution  Invariance  (RDI)  Condition):  For  all  subtasks  Mi 
and  states  s  in  the  hierarchy,  the  distribution  of  states  reached  after  the  execution  of  any 
subtask  Ma  (Ma  is  one  of  Mi  s  children)  is  independent  of  the  policy  of  subtask  Ma,  jia, 
and  the  policies  of  Ma’ s  descendants,  i.e.,  Pft(s'\s,  a)  =  Pj(s'|s,  a).  □ 

In  other  words,  states  reached  after  the  execution  of  a  subtask  cannot  be  changed  by  al¬ 
tering  the  policies  of  the  subtask  and  its  descendants.  Note  that  the  RDI  condition  does  not 
hold  for  every  problem,  and  therefore  the  HH-Leaming  algorithm  is  neither  hierarchically 
nor  recursively  optimal  in  general. 
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Another  approach  is  to  formulate  subtasks  as  continuing  average  reward  problems, 
where  the  goal  at  each  subtask  is  to  maximize  its  gain  given  the  policies  of  its  children 
(Ghavamzadeh  and  Mahadevan,  2001).  We  first  describe  this  approach  in  detail  in  Sections 
4.2.1  and  4.2.2.  In  Section  4.2.3,  we  use  this  method  to  find  recursively  optimal  average 
reward  policies,  and  present  discrete-time  and  continuous-time  recursively  optimal  aver¬ 
age  reward  RL  (RAR)  algorithms.  Finally  in  Section  4.2.4,  we  investigate  the  conditions 
under  which  the  policy  learned  by  the  RAR  algorithm  at  each  subtask  is  independent  of  the 
context  in  which  it  is  executed  and  therefore  can  be  reused  by  other  hierarchies. 

4.2.1  Root  Task  Formulation 

In  our  approach,  we  consider  those  problems  for  which  Assumption  4.1  ( Continuing 
Root  Task )  and  the  following  assumption  hold. 

Assumption  4.3  (Root  Task  Recurrence):  There  exists  a  state  s*[}  e  S0  such  that,  for 
every  hierarchical  policy  p  and  for  every  state  s  G  S0,  we  have4 

I 'So  | 

N= 1 

where  is  the  multi-step  abstract  transition  probability  function  of  root  under  the  hierar¬ 
chical  policy  p  described  in  Section  3.2,  and  \S0\  is  the  number  of  states  in  the  state  space 
of  root.  □ 

Assumption  4.3  is  equivalent  to  assuming  that  the  underlying  Markov  chain  at  root 
for  every  hierarchical  policy  p  has  a  single  recurrent  class,  and  state  is  a  recurrent  state. 
The  recurrent  state  Sq  can  be  a  terminal  state  of  any  of  roof  s  children.  If  Assumptions  4. 1 

4Notice  that  the  root  task  is  represented  as  subtask  A/()  in  the  HRL  framework  described  in  Chapter  3.  So 
we  use  index  0  to  represent  every  component  of  the  root  task. 
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and  4.3  hold,  the  gain  at  the  root  task  under  the  hierarchical  policy  /r,  g{f,  is  well  defined 
for  every  hierarchical  policy  //  and  does  not  depend  on  the  initial  state.  When  the  state 
space  at  root  is  finite  or  countable,  the  average  reward  or  gain  at  root  can  be  written  as 

M  =  ™or£{s,Ho{s)) 

9°  m%yg(s,fio(s)) 

where  r^(s,  Ho(s))  and  y^(s,n 0(s))  denote  the  expected  total  reward  and  the  expected 
number  of  time  steps  between  two  decision  epochs  at  root,  given  that  the  system  occupies 
state  s  at  the  first  decision  epoch  and  the  agent  chooses  its  actions  according  to  the  hier¬ 
archical  policy  //.  The  terms  ruff  and  mf"'  =  limjv->oo  jj  Y^oi^oY  are  the  transition 
probability  matrix  and  the  limiting  matrix  of  the  embedded  Markov  chain  at  root  for  the 
hierarchical  policy  /r  respectively.  The  transition  probability  rn^  is  obtained  by  marginal¬ 
izing  the  multi-step  abstract  transition  probability  F0M.  The  term  raff (V|s,  fJ>o(s))  denotes 
the  probability  that  the  SMDP  at  root  occupies  state  s'  at  the  next  decision  epoch,  given 
that  the  agent  chooses  action  n0(s)  in  state  s  at  the  current  decision  epoch  and  follows  the 
hierarchical  policy  /i. 

4.2.2  Subtask  Formulation 

In  Section  4.2.1,  we  described  the  average  reward  formulation  of  the  root  task  of  a  hi¬ 
erarchical  decomposition.  In  this  section,  we  illustrate  how  we  formulate  all  other  subtasks 
in  a  hierarchy  as  average  reward  problems.  From  now  on  in  this  chapter,  we  use  subtask  to 
refer  to  non-primitive  subtasks  in  a  hierarchy  except  root. 

In  the  HRL  methods,  we  typically  assume  that  every  time  a  subtask  Mt  is  executed,  it 
starts  at  one  of  its  initial  states  (e  Tt)  and  terminates  at  one  of  its  terminal  states  (e  T,)  after 
a  finite  number  of  steps.  Therefore,  we  can  make  the  following  assumption  for  every  sub¬ 
task  Mt  in  a  hierarchy.  Under  this  assumption,  each  subtask  can  be  considered  an  episodic 
problem  and  each  instantiation  of  a  subtask  can  be  considered  an  episode. 
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Assumption  4.4  (Subtask  Termination):  There  exists  a  dummy  state  s*  €  S)  such  that, 
for  every  action  a  £  A*  and  every  terminal  state  sjv,  we  have 

n  {sTi ,  a)  =  0  and  P*  (s  * ,  1 1  sTi ,  a)  =  1 

and  for  all  hierarchical  stationary  policies  //  and  non-terminal  states  s  £  Si,  we  have 

1|«)  =  0 

and  finally  for  all  states  s  e  Si,  we  have 

Fns-,N\S)>0 

where  Ff1  is  the  multi-step  abstract  transition  probability  function  of  subtask  Mi  under  the 
hierarchical  policy  /i  described  in  Section  3.2,  and  N  =  | ,5',;  is  the  number  of  states  in  the 
state  space  of  subtask  M%.  □ 

Although  subtasks  are  episodic  problems,  when  the  overall  task  ( root  of  the  hierarchy) 
is  continuing  as  we  assumed  in  this  chapter  (Assumption  4.1),  they  are  executed  infinite 
number  of  times,  and  therefore  can  be  modeled  as  continuing  problems  using  the  model 
described  in  Figure  4.1.  In  this  model,  each  subtask  M%  terminates  at  one  of  its  terminal 
states  STi  ^  Ti.  All  terminal  states  transit  with  probability  1  and  reward  0  to  a  dummy 
state  s*.  This  is  a  dummy  transition  and  does  not  add  a  time-step  to  the  cycle  of  subtask 
Mi  and  therefore  is  not  taken  into  consideration  when  the  average  reward  of  subtask  Mt  is 
calculated.  Finally,  the  dummy  state  s*  transits  with  reward  zero  to  one  of  the  initial  states 
(£  Xi)  of  subtask  Mi  upon  the  next  instantiation  of  M,.  It  is  important  for  the  validity  of 
this  model  to  fix  the  value  of  dummy  states  to  zero. 
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Set  of  Set  of 

Terminal  States  T;  I  +  ...  +  I  =1  Initial  States/: 

1  1  n  1 


Figure  4.1.  This  figure  shows  how  each  subtask  in  a  hierarchical  decomposition  of  a  con¬ 
tinuing  problem  can  be  modeled  as  a  continuing  task. 


Under  this  model,  for  every  hierarchical  policy  //,  each  subtask  Mt  in  the  hierarchy  can 
be  modeled  using  a  new  MDP  with  abstract  transition  probabilities  and  rewards 


[  F^s\l\s) 

W  1|«)  =  < 

s  =  s* 

r^(s,a)  =  rf(s,a) 

(4.13) 


where  U(s)  is  the  probability  that  subtask  M,  starts  at  state  s. 

Let  T\‘'  be  the  set  of  all  abstract  transition  probability  functions  F\\  We  have  the  fol¬ 
lowing  result  for  subtask 


Lemma  4.1:  Let  Assumption  4.4  ( Subtask  Termination)  hold.  Then  for  every  e  Tf 
and  every  state  s  G  St,  we  have  Y^n=i  ^iSs*n  -^ls)  >  0.5  □ 

Lemma  4.1  is  equivalent  to  assuming  that  for  every  subtask  Mz  in  the  hierarchy,  the  un- 

5This  lemma  is  a  restatement  of  the  Lemma  5  on  page  34  of  Peter  Marbach’s  thesis  (Marbach,  1998). 
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derlying  Markov  chain  for  every  hierarchical  policy  /x  has  a  single  recurrent  class  and  state 
s*  is  its  recurrent  state.  Under  this  model,  the  gain  of  subtask  M%  under  the  hierarchical 
policy  /x,  g? ,  is  well  defined  for  every  hierarchical  policy  /x  and  does  not  depend  on  the 
initial  state.  When  the  state  space  of  subtask  M%  is  finite  or  countable,  the  gain  of  subtask 
Mi  can  be  written  as 


M  =  rnllrli(S^i(S^ 

9l  rh%y?.(s,  iii(s)) 

where  r£(s,  Hi(s))  and  y?(s,  /Xj(s))  are  equal  to  rf  (s,  /Xj(s))  and  yf  (s,  /Xj(s)),  and  denote 
the  expected  total  reward  and  the  expected  number  of  time  steps  between  two  decision 
epochs  of  subtask  given  that  the  system  occupies  state  s  at  the  first  decision  epoch 
and  the  agent  chooses  its  actions  according  to  the  hierarchical  policy  /x.  The  terms  ra^ 
and  m/]  =  lini/v^oc  jj  X//=n'  (rn>[i)'  are  the  transition  probability  matrix  and  the  limiting 
matrix  of  the  Markov  chain6  at  subtask  M%  for  the  hierarchical  policy  /x  respectively.  The 
transition  probability  is  obtained  by  marginalizing  the  multi-step  abstract  transition 
probability  Ff\ 

4.2.3  Recursively  Optimal  Average  Reward  RL  Algorithm 

In  this  section,  we  present  discrete-time  and  continuous-time  recursively  optimal  av¬ 
erage  reward  RL  (RAR)  algorithms  using  the  formulation  described  in  Sections  4.2.1  and 
4.2.2.  We  consider  problems  for  which  Assumptions  4.1,  4.3,  and  4.4  ( Continuing  Root 
Task ,  Root  Task  Recurrence ,  and  Subtask  Termination )  hold,  root  is  modeled  as  an  average 
reward  problem  as  described  in  Section  4.2.1,  and  every  other  non-primitive  subtask  in  the 
hierarchy  is  modeled  as  an  average  reward  problem  using  the  model  described  in  Section 
4.2.2.  Under  these  assumptions,  the  average  reward  for  every  non-primitive  subtask  in  the 
hierarchy  including  root  is  well  defined  for  every  hierarchical  policy  and  does  not  vary  with 

6This  Markov  chain  corresponds  to  the  MDP  at  subtask  M ,  deft  ned  by  Equation  4.13,  not  the  original 
MDP  at  subtask  M,  deft  ned  by  tf  and  rf . 
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initial  state.  Since  we  are  interested  in  finding  a  recursively  optimal  average  reward  policy, 
we  do  not  need  to  include  the  contents  of  the  Task-Stack  as  a  part  of  the  state  space  of  the 
problem.  We  also  replace  projected  value  and  action-value  functions  in  the  hierarchical 
model  of  Chapter  3  with  projected  average-adjusted  value  and  projected  average-adjusted 
action-value  functions  described  in  Sections  2.2.3  and  2.3.2. 

We  show  how  the  overall  projected  average-adjusted  value  function  s)  is  decom¬ 

posed  into  a  collection  of  projected  average-adjusted  value  functions  of  individual  subtasks 
////  (z,  s)  for  i  —  1, . . .  ,  m  —  1,  in  the  RAR  algorithm.  The  projected  average-adjusted  value 
function  of  hierarchical  policy  //  on  subtask  Mt  is  the  average-adjusted  (with  respect  to  the 
local  gain  g f )  sum  of  rewards  earned  by  following  policy  //,  and  the  policies  of  all  descen¬ 
dants  of  subtask  M,  starting  in  state  s  until  subtask  Mi  terminates.  Now  let  us  suppose  that 
the  first  action  chosen  by  //,  is  invoked  and  executed  for  a  number  of  primitive  steps  N  and 
terminates  in  state  s'  according  to  multi-step  transition  probability  P^{s' ,  N\s,  nds)).  We 
can  write  the  projected  average-adjusted  value  function  in  the  form  of  a  Bellman  equation 
as 


=  rf(s,  fii(s)) 


g?y?( 


Ms))+  £  P?(s',N\s,im(s))H> 


U,  s 


(4.14) 


N,s'eSi 


Since  the  term  rf  (s,  Afi(s))  is  the  expected  total  reward  between  two  decision  epochs 
of  subtask  Mt,  given  that  the  system  occupies  state  s  at  the  first  decision  epoch,  the  agent 
chooses  action  /i*(s),  and  the  number  of  time  steps  until  the  next  decision  epoch  is  defined 

by  %M(-s,/ri(s)),  we  have 


r?(s,m(s)) 


V»(Hi(s),s)  =  H»(ni(s),s)  +  gZi{s)y?(s,Hi(s)) 


if  Mi  is  a  non-primitive  subtask, 


if  Mi  is  a  primitive  action. 
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By  replacing  rf  (s,  ^i(s))  from  the  above  expression,  Equation  4.14  can  be  written  as 


(  -  (sf  -  g"i{s))y?(s,  m{s))  +  Eiv.s'eS;  ^(s' ,  N\s,  s') 

if  Mi  is  a  non-primitive  subtask, 

V^{m{s),s)  -  g?y?(s,  m{s))  +  Eiv.s'es,  pt(s^  N\si  w(s))#M(b  s') 

if  Mt  is  a  primitive  action. 


(4.15) 


We  can  re-state  Equations  4.15  for  projected  action-value  function  as  follows: 


L*{i,s,a)  =  < 


H^(a,  s )  -  (pf  -  g£)y?(s,  a)  +  EjvyeSi  (s'>  N\s,  a)L^(i,  s', 

if  Mi  is  a  non-primitive  subtask, 

l>(a,  s)  -  g?y?{s,  a)  +  EtvyeSi  (s'>  N\s .  a)L^{i,  s' ,  im(s')) 

if  Mi  is  a  primitive  action. 


(4.16) 


In  the  above  equation,  when  Mj  is  a  non-primitive  subtask,  the  term 


(9?  ~  9a)y?(sia)  +  Y\  P?(s',N\s,a)L'*(i,s',fjli(s')) 


N,s'es, 


and  when  Mj  is  a  primitive  action,  the  term 


-9iV^{s,a)+  P?(S'>N  |s,a)L'"(i,s,,/Xj(s/)) 

N,s'eSi 

denote  the  average-adjusted  reward  of  completing  subtask  Mi  after  executing  action  a  in 
state  s.  We  call  this  term  completion  function  after  Dietterich  (2000),  and  denote  it  by 
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C,J(i,  s,  a).  With  this  definition,  we  can  express  the  average-adjusted  action-value  function 
IJJ  recursively  as 


L^(i,  s,  a) 


H^(a,  s)  +  s,  a ) 

RM(a,  s)  +  s,  a) 


if  Mi  is  a  non-primitive  subtask, 
if  Mi  is  a  primitive  action. 


(4.17) 


where 


C^{i,s,a)  =  < 


~{9i  -  9a)y^(s,  a)  +  Eiv,s'e5i  P?(s'’  N\s’  s' ,  m(st)) 

if  Mi  is  a  non-primitive  subtask, 

-9iVi{s,  a)  +  YjN,s'eSi  P^s'^  N\s » s\ 

if  Mi  is  a  primitive  action. 


(4.18) 


and 


H’t(i,s)  =  L>*(i,8,fj,i(s))  (4.19) 

when  Mi  is  a  non-primitive  subtask. 

Equations  4.15  to  4.19  are  the  decomposition  equations  for  projected  average-adjusted 
value  and  projected  average-adjusted  action-value  functions.  They  can  be  used  to  obtain  up¬ 
date  formulas  for  H  and  C  in  this  recursively  optimal  average  reward  model.  Pseudo-code 
for  the  discrete-time  recursively  optimal  average  reward  RL  (RAR)  algorithm  is  shown  in 
Algorithm  2.  In  this  algorithm,  a  gain  is  defined  for  every  non-primitive  subtask  in  the 
hierarchy  and  this  gain  is  updated  every  time  a  subtask  is  non-randomly  chosen.  Prim¬ 
itive  subtasks  store  their  projected  value  functions,  and  update  them  using  the  equation 
on  Line  5.  Non-primitive  subtasks  store  their  completion  functions  and  gains,  and  update 
them  using  equations  on  Lines  17,  19,  and  23.  The  projected  average-adjusted  action- value 
function  L  on  Lines  12,  17,  and  19  are  recursively  calculated  using  Equations  4.17  to  4.19. 
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Algorithm  2  Discrete-time  recursively  optimal  average  reward  RL  (RAR)  algorithm. 

1:  Function  RAR(Task  Mj,  State  s) 

2:  let  Seq  ={  }be  the  sequence  of  (state  visited ,  reward )  while  executing  subtask  M, 

3:  if  Mi  is  a  primitive  action  then 

4:  execute  action  i  in  state  s,  observe  state  s'  and  reward  r(s.  i ) 

5:  Vt+i (i,  s)  <-  [1  -  at{i)]Vt(i,  s)  +  at(i)r(s,  i ) 

6:  push  ( state  s,  reward  r(s,  i))  into  the  beginning  of  Seq 

7:  else  /*  M,  is  a  non-primitive  subtask  */ 

8:  while  Mi  has  not  terminated  do 

9:  choose  action  (subtask)  a  according  to  the  current  exploration  policy  ft,  (s) 

10:  let  Ch  ildSeq= R  A  R  f Ma ,  s),  where  ChildSeq  is  the  sequence  of  (state  visited,  re¬ 

ward)  while  executing  subtask  Ma 
11:  observe  result  state  s' 

12:  let  a*  =  arg  maxa,gA.(s,)  Lt(i,  s' ,  a') 

13:  let  iV  =  0;  p  =  0; 

14:  for  each  (s,  r)  in  ChildSeq  from  the  beginning  do 

15:  N  —  N- 1-1;  p  =  p  +  r; 

16:  if  a  is  a  primitive  action  then 

17:  Ct+1(i,s,a)  [1  -  at(i)]Ct(i,  s,a)  +  at(i)[Lt(i,  s' ,  a*)  -  gt(i)N] 

18:  else  /*  Ma  is  a  non-primitive  subtask  */ 

19:  Ct+i(i,s,a)  <-  [l-at(i)]Ct(i,  s,  a)+at(i){Lt{i,  s',  a*)-[gt(i)-gt(a)]N} 

20:  end  if 

21:  end  for 

22:  if  Ma  and  all  its  ancestors  are  non-random  actions  then 

23:  update  gain  of  subtask  M,:  gt+1(i)  =  =  0^ 

24:  end  if 

25:  append  ChildSeq  onto  the  front  of  Seq 

26:  s  —  s' 

27:  end  while 

28:  end  if 
29:  return  Seq 
30:  end  RAR 
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This  algorithm  can  be  easily  extended  to  continuous-time.  In  the  continuous-time  ver¬ 
sion  of  the  RAR  algorithm,  in  addition  to  visited  state  and  reward,  we  need  to  push  the 
execution  time  of  primitive  actions  r  into  the  Seq.  Therefore  A  =  Ar  +  1  on  Line  15  of  the 
algorithm  is  replaced  by  T  =  T  +  r.  We  also  need  to  modify  the  update  formulas  for  V, 
C,  and  gt  on  Lines  5,  17,  19,  and  23  as 

Vt+i(h  s)<— [1  -  at(i)\Ht(i,  s)  +  at(i)  [ k(s,i )  +r(s,i)r(s,i)} 


Ct+i(i,s,a)  <-  [1  -  at(i)\Ct(i,s,a)  +  at(i)[Lt(i,  s',  a*)  -  gt(i)T] 


Ct+i(i,s,a)  <-  [1  -  at(i)]Ct(i,s,a)  +  at(i)[Lt(i,s',a*)  -  ( gt(i )  -  gt(a))T ] 


9t+ 1(*) 


n+i(i) 


n(i)  +  p 

tt.(i)  +  T 


where  r(s,i )  is  the  time  elapsing  between  state  s  and  the  next  state,  k(s,i )  is  the  fixed 
reward  of  taking  action  i  in  state  s,  and  r(s,  i )  is  the  reward  rate  for  the  time  between  state 
s  and  the  next  state. 


4.2.4  Optimality  of  the  RAR  Algorithm 

In  this  section,  we  investigate  the  optimality  achieved  by  the  RAR  algorithm.  In  the 
RAR  algorithm,  since  the  expected  average-adjusted  reward  after  execution  of  subtask  M, 
is  not  a  component  of  the  average-adjusted  value  function  of  subtask  M*,  the  algorithm  fails 
to  find  a  hierarchically  optimal  average  reward  policy  in  general,  as  it  has  been  discussed 
in  (Seri  and  Tadepalli,  2002)  and  we  will  demonstrate  it  in  the  experiments  of  Section  4.3. 

To  achieve  recursive  optimality,  the  policy  learned  for  each  subtask  must  be  context 
free,  that  is,  each  subtask  maximizes  its  local  gain  given  the  policies  of  its  descendants.  In 
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RAR  algorithm,  although  each  subtask  maximizes  its  local  gain  given  the  policies  of  its 
descendants,  the  policy  learned  for  each  subtask  is  not  necessarily  context  free,  and  as  a 
result  the  algorithm  does  not  find  a  recursively  optimal  average  reward  policy  in  general. 
The  reason  for  that  is,  the  local  gain  gi  for  each  subtask  Mi  does  not  depend  only  on  the 
policies  of  its  descendants.  The  local  gain  gt  is  the  gain  of  the  SMDP  defined  by  Equation 
4.13  and  therefore  depends  on  the  initial  state  distribution  Jj(s).  The  initial  state  distribu¬ 
tion  Jj(s)  depends  not  only  on  the  policies  of  Add s  descendants,  but  also  on  the  policies  of 
its  parents,  which  makes  the  local  gain  g.t  context  dependent.  However,  the  algorithm  finds 
a  recursively  optimal  average  reward  policy  when  the  initial  distribution  invariance  (IDI) 
condition  holds.  In  this  case,  the  policy  learned  by  this  method  at  each  subtask  is  indepen¬ 
dent  of  the  context  in  which  it  is  executed  and  therefore  can  be  reused  by  other  hierarchies. 

Definition  4.2  (Initial  Distribution  Invariance  (IDI)  Condition):  The  initial  state  dis¬ 
tribution  for  each  non-primitive  subtask  in  the  hierarchy  is  independent  of  the  policies  of 
its  parents.  □ 

In  other  words,  the  initial  state  distribution  for  each  non-primitive  subtask  cannot  be  changed 
by  altering  the  policies  of  its  parents.  One  special  case  that  satisfies  the  IDI  condition  is 
when  each  non-primitive  subtask  in  the  hierarchy  has  only  one  initiation  state,  \Tt\  =  1  for 
i  —  1, . . .  ,m  —  1,  and  Adi  is  a  non-primitive  subtask. 

4.3  Experimental  Results 

The  goal  of  this  section  is  to  demonstrate  the  efficacy  of  the  algorithms  proposed  in 
Sections  4.1  and  4.2.  We  show  the  type  of  optimality  that  they  converge  to  as  well  as  their 
performance  and  speed  comparing  to  other  algorithms.  We  conduct  two  sets  of  experiments 
in  this  section.  In  Section  4.3.1,  we  apply  five  HRL  algorithms  to  a  simple  discrete-time 
AGV  scheduling  problem.  The  advantage  of  using  this  simple  domain  is  that  it  clearly 
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demonstrates  the  difference  between  the  optimality  criteria  achieved  by  these  algorithms. 
Then  we  turn  to  a  more  complex  AGV  scheduling  task  in  Section  4.3.2.  In  this  section,  we 
model  an  AGV  scheduling  task  as  discrete  time  and  continuous  time  problems  and  apply 
HAR  and  RAR  algorithms  as  well  as  a  flat  average  reward  RL  algorithm  to  both  models. 

4.3.1  A  Small  AGV  Scheduling  Problem 

In  this  section,  we  apply  the  discrete-time  hierarchically  optimal  average  reward  RL 
(HAR)  algorithm  described  in  Section  4.1,  the  discrete -time  recursively  optimal  average 
reward  RL  (RAR)  algorithm  described  in  Section  4.2,  and  HH-Learning,  the  algorithm 
proposed  by  Seri  and  Tadepalli  (2002),  to  a  small  AGV  scheduling  task.  We  also  test 
MAXQ-Q,  the  recursively  optimal  discounted  reward  HRL  algorithm  proposed  by  Diet- 
terich  (2000),  and  a  hierarchically  optimal  discounted  reward  RL  algorithm  (HDR)  on 
this  task.  The  HDR  algorithm  is  an  extension  of  the  MAXQ-Q  using  the  three-part  value 
function  decomposition  proposed  by  Andre  and  Russell  (2002)  described  in  Chapter  3. 
These  experimental  results  clearly  demonstrate  the  difference  between  the  optimality  crite¬ 
ria  achieved  by  these  algorithms. 

A  small  AGV  domain  is  depicted  in  Figure  4.2.  In  this  domain  there  are  two  machines 
M 1  and  M2  that  produce  parts  to  be  delivered  to  corresponding  destination  stations  G 1 
and  G2.  Since  machines  and  destination  stations  are  in  two  different  rooms,  the  AGV  has 
to  pass  one  of  the  two  doors  D 1  and  D 2  every  time  it  goes  from  one  room  to  another. 
Part  1  is  more  important  than  part  2,  therefore  the  AGV  gets  a  reward  of  20  when  part  1 
is  delivered  to  destination  G1  and  a  reward  of  1  when  part  2  is  delivered  to  destination 
G2.  The  AGV  receives  a  reward  of  -1  for  all  other  actions.  This  task  is  deterministic  and 
the  state  variables  are  AGV  location  and  AGV  status  (empty,  carry  part  1  or  carry  part  2), 
which  is  total  of  26  x  3  =  78  states.  In  all  experiments,  we  use  the  task  graph  shown  in 
Figure  4.2  and  set  the  discount  factor  to  0.99  for  the  discounted  reward  algorithms.  We  tried 
several  discounting  factors  and  0.99  yielded  the  best  performance.  Using  this  task  graph, 
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hierarchical  and  recursive  optimal  policies  are  different.  Since  delivering  part  1  has  more 
reward  than  part  2,  the  hierarchically  optimal  policy  is  one  in  which  the  AGV  always  serves 
machine  Ml.  In  the  recursively  optimal  policy,  the  AGV  switches  from  serving  machine 
M 1  to  serving  machine  M2  and  vice  versa.  In  this  policy,  the  AGV  goes  to  machine  M 1, 
picks  up  a  part  of  type  1,  goes  to  goal  G 1  via  door  D  1 ,  drops  the  part  there,  then  passes 
through  door  D 2,  goes  to  machine  M2,  picks  up  a  part  of  type  2,  goes  to  goal  G2  via  door 
D2  and  then  switches  again  to  machine  Ml  and  so  on  so  forth. 


Ml 

G2 

Dl 

D2 

M2 

Gl 

Ml:  Machine  1  M2:  Machine  2 


Dl:  Door  1  D2:  Door  2  Gl:  Goal  1 


G2:  Goal  2 


Figure  4.2.  A  small  AGV  scheduling  task  and  its  associated  task  graph. 


Among  the  algorithms  we  applied  to  this  task,  the  hierarchically  optimal  average  reward 
RL  (HAR)  and  the  hierarchically  optimal  discounted  reward  RL  (HDR)  algorithms  find  the 
hierarchically  optimal  policy,  where  the  other  algorithms  only  learn  the  recursively  optimal 
policy.  Figure  4.3  demonstrates  the  throughput  of  the  system  for  the  above  algorithms.  In 
this  figure,  the  throughput  of  the  system  is  the  number  of  parts  deposited  at  the  destination 
stations  weighted  by  their  reward  (part  1  x  20  +  part2  x  1)  in  10,000  time  steps.  Each 
experiment  was  conducted  ten  times  and  the  results  were  averaged. 

4.3.2  AGV  Scheduling  Problem  (Discrete  and  Continuous  Time  Models) 

In  this  section,  we  describe  two  sets  of  experiments  on  an  AGV  scheduling  problem 
shown  in  Figure  4.4.  M 1  to  M 3  are  workstations  in  this  environment.  Parts  of  type  i  have 
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Figure  4.3.  This  plot  shows  that  HDR  and  HAR  algorithms  (the  two  curves  at  the  top) 
leam  the  hierarchically  optimal  policy  while  RAR,  MAXQ-Q,  and  HH-Leaming  (the  three 
curves  at  the  bottom)  only  find  the  recursively  optimal  policy  for  the  small  AGV  scheduling 
task. 


to  be  carried  to  the  drop-off  station  at  workstation  i  (D,),  and  the  assembled  parts  brought 
back  from  pick-up  stations  of  workstations  (Pi  s)  to  the  warehouse.  The  AGV  travel  is 
unidirectional  as  the  arrows  show.  We  model  this  AGV  scheduling  task  using  both  discrete¬ 
time  and  continuous-time  models  and  demonstrate  the  performance  and  speed  of  three  HRL 
algorithms:  hierarchically  optimal  average  reward  RL  (HAR),  recursively  optimal  average 
reward  RL  (RAR),  and  hierarchically  optimal  discounted  reward  RL  (HDR)  as  well  as  a 
non-hierarchical  average  reward  algorithm  in  this  problem.  In  both  experiments,  we  use 
the  task  graph  shown  in  Figure  4.5  for  the  AGV  scheduling  problem,  and  discount  factors 
0.9  and  0.95  for  discounted  reward  algorithms.  Using  discount  factor  0.95  yielded  better 
performance  in  both  experiments. 

The  state  of  the  environment  consists  of  the  number  of  parts  in  the  pick-up  and  drop-off 
stations  of  each  machine,  and  whether  the  warehouse  contains  parts  of  each  of  the  three 
types.  In  addition,  agent  keeps  track  of  its  own  location  and  status  as  a  part  of  its  state 


77 


D3 


P:  Pick  up  Buffer 
D :  Drop  off  B  uffer 
M:  Machine 
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Figure  4.4.  An  AGV  scheduling  task.  An  AGV  agent  (not  shown)  carries  raw  materials 
and  finished  parts  between  machines  and  warehouse. 
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space.  Thus  in  the  flat  case,  state  space  consists  of  33  locations,  6  buffers  of  size  2,  7 
possible  states  of  the  AGV  (carrying  Parti,  . . .  ,  carrying  Assembly  1,  . . .  ,  empty),  and  2 
values  for  each  part  in  the  warehouse,  i.e.,  33  x  36  x  7  x  23  =  1,  347, 192  states.  State 
abstraction  helps  in  reducing  the  state  space  considerably.  Only  the  relevant  state  variables 
are  used  while  storing  the  value  functions  in  each  node  of  the  task  graph.  For  example,  for 
the  Navigation  subtask,  only  the  location  state  variable  is  relevant  and  this  subtask  can  be 
learned  with  only  33  values.  Hence  for  each  of  the  high-level  subtasks  DM  1, . . .  ,  DM3, 
the  number  of  relevant  states  would  be33x7x3x2  =  1,  386,  and  for  each  of  the  high-level 
subtasks  DAI, . . .  ,  DA3,  the  number  of  relevant  states  would  be  33  x  7  x  3  =  693.  This 
state  abstraction  gives  us  a  compact  way  of  representing  the  value  functions  and  speeds  up 
the  algorithm. 

The  discrete-time  experimental  results  were  generated  with  the  following  model  param¬ 
eters.  The  inter-arrival  time  for  parts  at  the  warehouse  is  uniformly  distributed  with  a  mean 
of  12  time  steps  and  variance  of  2  time  steps.  The  percentage  of  Parti,  Part2,  and  Part3  in 
the  part  arrival  process  are  40,  35,  and  25  respectively.  The  time  required  for  assembling  the 
various  parts  are  Poisson  random  variables  with  means  6,  10,  and  12  time  steps  for  Parti, 
Part2,  and  Pari 3  respectively,  and  variance  2  time  steps.  Table  4.1  shows  the  parameters  of 
the  discrete-time  model. 


Parameter 

Distribution 

Mean  (steps) 

Variance  (steps) 

Assembly  Time  for  Parti 

Poisson 

6 

2 

Assembly  Time  for  Part2 

Poisson 

10 

2 

Assembly  Time  for  Part3 

Poisson 

12 

2 

Inter- Arrival  Time  for  Parts 

Uniform 

12 

2 

Table  4.1.  Parameters  of  the  Discrete-Time  Model 


The  continuous-time  experimental  results  were  generated  with  the  following  model 
parameters.  The  time  required  for  execution  of  each  primitive  action  is  a  normal  random 
variable  with  mean  10  seconds  and  variance  2  seconds.  The  inter-arrival  time  for  parts 
at  the  warehouse  is  uniformly  distributed  with  a  mean  of  100  seconds  and  variance  of  20 
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seconds.  The  percentage  of  Parti ,  Part2,  and  Pari 3  in  the  part  arrival  process  are  40,  35, 
and  25  respectively.  The  time  required  for  assembling  the  various  parts  are  normal  random 
variables  with  means  100,  120,  and  180  seconds  for  Parti ,  Part2,  and  Pari 3  respectively, 
and  variance  20  seconds.  Table  4.2  contains  the  parameters  of  the  continuous-time  model. 
In  both  cases,  each  experiment  was  conducted  five  times  and  the  results  were  averaged. 


Parameter 

Type  of  Distribution 

Mean  (sec) 

Variance  (sec) 

Execution  Time  for  Primitive  Actions 

Normal 

10 

2 

Assembly  Time  for  Parti 

Normal 

100 

20 

Assembly  Time  for  Part2 

Normal 

120 

20 

Assembly  Time  for  Part3 

Normal 

180 

20 

Inter- Arrival  Time  for  Parts 

Uniform 

100 

20 

Table  4.2.  Parameters  of  the  Continuous-Time  Model 


Figure  4.6  compares  the  discrete-time  hierarchically  optimal  average  reward  RL  (HAR) 
algorithm  described  in  Section  4.1  with  the  discrete-time  discounted  reward  hierarchi¬ 
cally  optimal  (HDR)  algorithm,  and  the  discrete-time  recursively  optimal  average  reward 
RL  (RAR)  algorithm  illustrated  in  Section  4.2.  The  graph  shows  the  improved  perfor¬ 
mance  of  the  HAR  algorithm.  This  figure  also  shows  that  the  HAR  algorithm  converges 
faster  to  the  same  throughput  as  the  non-hierarchical  average  reward  algorithm.  The  non- 
hierarchical  average  reward  algorithm  used  in  this  experiment  is  relative  value  iteration 
(RVI)  Q-leaming  (Abounadi  et  al.,  2001).  The  difference  in  convergence  speed  between 
flat  and  hierarchical  algorithms  becomes  more  significant  as  we  increase  the  number  of 
states. 

Figure  4.7  compares  the  continuous-time  hierarchically  optimal  average  reward  RL 
(HAR)  algorithm  described  in  Section  4.1  with  the  continuous-time  hierarchically  opti¬ 
mal  discounted  reward  RL  (HDR)  algorithm,  and  the  continuous-time  recursively  optimal 
average  reward  RL  (RAR)  algorithm  illustrated  in  Section  4.2.  The  graph  shows  that  the 
HAR  algorithm  converges  to  the  same  performance  as  the  discounted  reward  HDR  algo¬ 
rithm,  and  both  have  better  performance  than  the  RAR  (recursively  optimal  average  reward) 
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Figure  4.6.  This  plot  shows  that  the  discrete-time  HAR  algorithm  performs  better  than 
the  discounted  reward  HDR  and  RAR  algorithms  on  the  AGV  scheduling  task.  It  also 
demonstrates  the  faster  convergence  of  the  HAR  algorithm  comparing  to  RVI  Q-leaming, 
the  non-hierarchical  average  reward  algorithm. 


algorithm.  This  figure  also  shows  that  the  HAR  algorithm  converges  faster  to  the  same 
throughput  as  the  non-hierarchical  average  reward  algorithm.  The  non-hierarchical  aver¬ 
age  reward  algorithm  used  in  this  experiment  is  a  continuous-time  version  of  the  relative 
value  iteration  (RVI)  Q-leaming  (Abounadi  et  al.,  2001).  The  difference  in  convergence 
speed  between  flat  and  hierarchical  algorithms  becomes  more  significant  as  we  increase 
the  number  of  states. 

These  results  are  consistent  with  the  hypothesis  that  the  average  reward  framework 
is  superior  to  the  discounted  framework  for  learning  continuing  tasks,  such  as  queuing, 
scheduling,  and  flexible  manufacturing.  Moreover,  average  reward  methods  do  not  need 
careful  tuning  of  the  discount  factor  to  find  gain-optimal  policies. 
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Figure  4.7.  This  plot  shows  that  the  continuous-time  HAR  converges  to  the  same  perfor¬ 
mance  as  the  discounted  reward  HDR,  and  both  outperform  the  recursively  optimal  aver¬ 
age  reward  (RAR)  algorithm  on  the  AGV  scheduling  task.  It  also  demonstrates  the  faster 
convergence  of  the  HAR  algorithm  comparing  to  RVI  Q-leaming,  the  flat  average  reward 
algorithm. 
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4.4  Summary  and  Future  Work 

This  chapter  presents  new  discrete-time  and  continuous-time  hierarchically  optimal  av¬ 
erage  reward  RL  (HAR)  and  recursively  optimal  average  reward  RL  (RAR)  algorithms  ap¬ 
plicable  to  continuing  tasks,  including  manufacturing,  scheduling,  queuing,  and  inventory 
control.  These  algorithms  are  based  on  the  average-reward  SMDP  model,  which  has  been 
shown  to  be  more  appropriate  for  a  wide  class  of  continuing  tasks  than  the  better  stud¬ 
ied  discounted  reward  SMDP  model.  Hierarchically  optimal  average  reward  RL  (HAR) 
algorithms  aim  to  find  a  hierarchical  policy  within  the  space  of  policies  defined  by  the  hier¬ 
archical  decomposition  that  maximizes  the  global  gain.  In  the  recursively  optimal  average 
reward  RL  setting,  the  formulation  of  learning  algorithms  directly  depends  on  the  local  op¬ 
timality  criterion  used  for  each  subtask  in  the  hierarchy.  The  recursively  optimal  average 
reward  RL  (RAR)  algorithms  proposed  in  this  chapter  treat  subtasks  as  continuing  average 
reward  problems  and  solve  them  by  maximizing  their  gain  given  the  policies  of  their  chil¬ 
dren.  We  investigate  the  conditions  under  which  the  policy  learned  by  the  RAR  algorithm 
at  each  subtask  is  independent  of  the  context  in  which  it  is  executed  and  therefore  can  be 
reused  by  other  hierarchies.  The  effectiveness  of  the  proposed  algorithms  were  tested  using 
two  AGV  scheduling  tasks. 

There  are  a  number  of  directions  for  future  work.  An  immediate  question  that  arises 
is  proving  the  asymptotic  convergence  of  the  algorithms  to  hierarchically  optimal  policies. 
These  results  should  provide  some  theoretical  validity  to  the  proposed  algorithms,  in  ad¬ 
dition  to  their  empirical  effectiveness  demonstrated  in  this  chapter.  Studying  other  local 
optimality  criteria  for  subtasks  in  the  hierarchy  is  an  interesting  problem  that  needs  to  be 
addressed.  It  helps  to  develop  more  effective  recursively  optimal  average  reward  RL  algo¬ 
rithms.  It  is  also  obvious  that  many  other  manufacturing  and  robotics  problems  can  benefit 
from  the  algorithms  proposed  in  this  chapter. 
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CHAPTER  5 


HIERARCHICAL  POLICY  GRADIENT  REINFORCEMENT 

LEARNING 


We  illustrated  value  function  (VF)  and  policy  gradient  (PG)  solutions  for  MDPs  in 
Section  2.2.4.  As  we  described  in  that  section,  there  are  only  weak  theoretical  guarantees  on 
the  performance  of  the  value  function  reinforcement  learning  (VFRL)  methods  on  problems 
with  large  discrete  or  continuous  state  spaces.  We  also  mentioned  that  policy  gradient 
reinforcement  learning  (PGRL)  algorithms  have  received  recent  attention  as  a  means  to 
solve  problems  with  continuous  state  spaces.  They  have  also  shown  better  performance 
when  states  are  hidden.  However,  they  are  usually  slower  than  VFRL  methods.  A  possible 
solution  is  to  incorporate  prior  knowledge  and  decompose  the  high-dimensional  task  into 
a  collection  of  modules  with  smaller  state  spaces  and  leam  these  modules  in  a  way  to 
solve  the  overall  problem.  Hierarchical  VFRL  methods  (Parr,  1998;  Sutton  et  al.,  1999; 
Dietterich,  2000;  Andre  and  Russell,  2001)  have  been  developed  using  this  approach,  as  an 
attempt  to  scale  RL  to  large  state  spaces. 

In  this  chapter,1  we  propose  a  family  of  hierarchical  policy  gradient  reinforcement 
learning  (HPGRL)  algorithms  for  scaling  PGRL  methods  to  problems  with  continuous  (or 
large  discrete)  state  and/or  action  spaces.  In  HPGRL,  non-primitive  subtasks  are  defined 
as  PGRL  problems.  Later  in  this  chapter,  we  accelerate  learning  in  HPGRL  algorithms 
by  formulating  high-level  subtasks,  which  usually  involve  smaller  state  and  finite  action 
spaces,  as  VFRL  problems,  and  low-level  subtasks  with  infinite  state  and/or  action  spaces 

'Most  of  the  work  presented  in  this  chapter  fi  rst  appeared  in  Ghavamzadeh  and  Mahadevan  (2003),  ‘Hi¬ 
erarchical  policy  gradient  algorithms,”  Proceedings  of  the  Twentieth  International  Conference  on  Machine 
Learning,  pp.  226-233. 
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as  PGRL  problems.  This  idea  is  similar  to  the  idea  used  by  Morimoto  and  Doya  (2001)  to 
leam  stand-up  behavior  in  a  three-link,  two-joint  robot.  We  call  this  family  of  algorithms 
hierarchical  hybrid  algorithms. 

The  rest  of  this  chapter  is  organized  as  follows.  In  Section  5.1,  we  describe  how  we 
define  each  subtask  in  a  hierarchy  as  a  PGRL  problem.  In  Section  5.2,  we  introduce  a 
family  of  HPGRL  algorithms  and  compare  the  performance  of  this  family  of  algorithms 
with  a  hierarchical  VFRL  algorithm  and  a  flat  RL  algorithm  in  a  simple  taxi-fuel  prob¬ 
lem.  In  Section  5.3,  we  propose  a  family  of  hierarchical  hybrid  algorithms  to  accelerate 
learning  in  HPGRL  algorithms.  We  illustrate  this  family  of  algorithms  and  demonstrate  its 
performance  using  a  continuous  state  and  action  ship  steering  problem.  Finally,  Section 
5.4  summarizes  the  chapter  and  discusses  some  directions  for  future  work. 

5.1  Policy  Gradient  Formulation 

In  this  section,  we  demonstrate  how  to  define  a  subtask  in  a  hierarchical  task  decompo¬ 
sition  as  a  PGRL  problem.  We  formulate  a  subtask  in  terms  of  a  parameterized  family  of 
policies  and  a  performance  function.  We  then  define  a  method  to  estimate  the  gradient  of 
the  performance  function  and  a  routine  to  update  the  policy  parameters  using  this  gradient. 
Our  focus  in  this  chapter  is  on  episodic  problems,  so  we  assume  that  the  overall  task  ( root 
of  the  hierarchy)  is  episodic. 

5.1.1  Policy  Formulation 

Each  subtask  Mi  is  defined  using  a  set  of  randomized  stationary  policies  Hi(di)  pa¬ 
rameterized  in  terms  of  a  parameter  vector  0t  e  1RA.  The  term  / / ( o  |  .s- :  0,)  denotes  the 
probability  of  taking  action  a  in  state  s  under  the  policy  corresponding  to  0,.  These  param¬ 
eterized  policies  for  individual  subtasks  define  a  set  of  parameterized  hierarchical  policies 
H{0),  where  6  is  the  vector  of  all  subtasks’  parameters.  For  every  subtask  M,  in  the  hier¬ 
archy,  we  make  the  following  assumption  about  its  set  of  parameterized  policies  Hi{0i). 


85 


Assumption  5.1:  For  every  state  s  G  S',  and  every  action  a  G  Ar,  n,(a\s:  0,  )  as  a  function 
of  6i,  is  bounded  and  has  bounded  first  and  second  derivatives.  Furthermore,  WUaMd  js 

f- H  ) 

bounded,  differentiable,  and  has  bounded  first  derivatives.  □ 

In  HRL  methods,  we  typically  assume  that  every  time  a  subtask  Ml  is  called,  it  starts 
at  one  of  its  initial  states  (G  1,)  and  terminates  at  one  of  its  terminal  states  (G  Tj)  after 
a  finite  number  of  steps.  Therefore,  we  make  the  following  assumption  for  every  subtask 
Mj  in  the  hierarchy.  Under  this  assumption,  each  subtask  can  be  considered  an  episodic 
problem  and  each  instantiation  of  a  subtask  can  be  considered  an  episode. 

Assumption  5.2  (Subtask  Termination):  We  define  a  dummy  state  s*  G  ,5',.  such  that, 
for  every  action  a  G  Aj  and  every  terminal  state  sxt,  we  have 


n{sTi,a) 

=  0 

and 

pi(s*,l\sTi,a 

ri(s*,a) 

=  0 

and 

Pi(s*,l\s*,a) 

and  for  all  hierarchical  stationary  policies  )  and  non-terminal  states  s  G  Si,  we  have 

^,8,W.l|s)=0 

and  finally  for  all  states  sGSj,  we  have 

F?te>(s‘,N\s)>0 

where  is  the  multi-step  abstract  transition  probability  function  of  subtask  Mj  under 
the  hierarchical  policy  n{0)  described  in  Section  3.2,  and  N  =  ,S',  |  is  the  number  of  states 
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in  the  state  space  of  subtask  Mj. 


□ 


Under  this  assumption,  all  terminal  states  of  subtask  Mi  transition  with  probability  1 
and  reward  0  to  the  dummy  state  s*  and  stay  there  until  the  next  instantiation  of  subtask  Mt 
as  shown  in  Figure  5.1.  This  is  a  dummy  transition  and  does  not  add  another  time-step  to 
the  cycle  of  subtask  Mj. 

Set  of  Set  of 

Initial  States  I  j  Terminal  States  T  j 


r=0 , p=l 


Figure  5.1.  This  figure  shows  how  we  model  a  subtask  as  an  episodic  problem  under 
Assumption  5.2. 


Under  this  model,  for  every  hierarchical  policy  p(0),  we  define  a  new  MDP  Mj.  for 
each  subtask  Mj  with  abstract  transition  probabilities  and  rewards 


(s',  l|s) 


Ii(s') 

s  =  s* 

r u(s,  a;  0)  =  r *(s,  a;  6) 


(5.1) 


where  /,  (,s)  is  the  probability  that  subtask  Mj  starts  at  state  s. 

Let  T1^0'  be  the  set  of  all  abstract  transition  probability  functions  FjM0> .  We  have  the 
following  result  for  subtask  Mj. 
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Lemma  5.1:  Let  Assumptions  5.1  and  5.2  hold.  Then  for  every  F e  F,l''°}  and  every 
state  s  e  Si,  we  have  X^lv=i  N\s)  >0.  □ 

Lemma  5.1  is  equivalent  to  assuming  that  the  MDP  Mjt  is  recurrent,  i.e.,  the  underly¬ 
ing  Markov  chain  for  every  policy  in  this  MDP  has  a  single  recurrent  class  and  the 

state  s*  is  a  recurrent  state.  In  this  case,  the  balance  equations 

\Si\ 

l|s)7ri(s)=7Tj(s/),  Vs'  G  Si  ,  s'  ^  s 

S=  1 

\Si\ 

^2ms)=i 

s= 1 

have  a  unique  solution  ■  We  refer  to  as  the  steady  state  probability  vector  of  the 
Markov  chain  with  transition  probabilities  defined  by  Equation  5.1,  and  to  n^G\s)  as  the 
steady  state  probability  of  being  in  state  s. 

5.1.2  Performance  Measure  Definition  and  Optimization 

We  define  weighted  reward-to-go,  Xj(0),  as  the  performance  measure  of  subtask  M, 
under  the  parameterized  hierarchical  policy  fi(6),  and  for  which  Assumption  5.2  holds,  as 

Xi(0)  =  £>(s)Ji(s;0) 

S(zSi 

The  term  J*(s;0)  is  the  reward-to-go  of  subtask  Mr  in  state  s  under  hierarchical  policy 
n(G)  and  is  defined  as 

"T— 1 

Ji(s]  0)  —  E  y^ri(sfc,afc)|s0  =  s;  0 

_k= 0 

where  T  =  ruin  {A;  >  0|s/,:  =  s* }  is  the  first  future  time  that  state  s*  is  visited.2 

2With  the  defi  nition  of  absorbing  state  s*  in  our  model  (see  Figure  5.1),  the  reward-to-go  of  subtask  Mt 
in  state  s,  Jj(s ;  9),  is  the  same  as  undiscounted  projected  value  function  of  subtask  M,  in  state  s. 
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In  order  to  obtain  an  expression  for  the  gradient  Vxi(0),  we  use  MDP  Mj.  defined  in 
Section  5.1.1.  Using  Lemma  5.1,  MDP  Mj.  is  recurrent.  For  MDP  Mjt,  let  n^9\s)  be 
the  steady  state  probability  distribution  of  being  in  state  s  at  subtask  Mi  and  let  Ejt  [T\d\ 
be  the  mean  recurrence  time  of  subtask  M{,  i.e.,  EI{  [ T\0 ]  =  Ej.  [T|s0  =  s*;  0],  under  the 
hierarchical  policy  /i(0).  We  also  define  J,(s,  a;  0)3  as 

T-l 

^  c  [z  ctfc)  I  So  s,  a0  cl,  0 

,k= 0 

Using  recurrent  MDP  MI{,  we  can  derive  the  following  proposition  which  gives  an  ex¬ 
pression  for  the  gradient  of  the  weighted  reward- to-go  Xi(0)  with  respect  to  the  parameter 
vector  6. 

Proposition  5.1:  If  Assumptions  5.1  and  5.2  hold 

V*(0)  =  E,t[T\0]  y  y 


Jii^S i  CL,  O')  E-  / , 


This  proposition  is  similar  to  Proposition  1  on  page  35  of  Marbach  (1998). 

The  expression  for  the  gradient  in  Proposition  5.1  can  be  estimated  over  a  renewal 
cycle  (cycle  between  consecutive  visits  to  recurrent  state  s*)  as 


tm-\- 1  1 

Fm>i{0)  =  ^  Ri(sn, 

Ti=tm 


Qjn,  O) 


V /ij(sn,  Cln ,  0i) 
/Tj(sn,  CLn,  Oi) 


(5.2) 


where  tm  is  the  time  of  the  mth  visit  at  the  recurrent  state  s*  and  /?,(*„,  an;  0)  = 
Y!r=n~l  ri(sn ,  an]  0 )  is  an  estimate  of  J?:(sn,  an;  0). 


’With  the  defi  nition  of  absorbing  state  in  Figure  5. 1 is  the  undiscounted  projected  action-value  function 
of  subtask  Mj. 
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From  Equation  5.2,  we  obtain  the  following  procedure  to  update  0 ,,  the  parameter 


vector  of  subtask  Mt,  along  the  approximate  gradient  direction  at  every  time  step. 


%k+l,i 


0 


Zk,i  + 


^7 Hi  (fl/c  S/e  :0k, i ) 


Sk  =  S*, 

otherwise. 


(5.3) 


@k+l,i  @k,i  T  ^k,i-^-i\0ki  ^  k)  Zk-\-l,i 


where  ak,i  is  the  step  size  parameter  for  subtask  M,  and  satisfies  the  following  assumptions. 


Assumption  5.3:  ak/ s  are  deterministic,  nonnegative,  and  satisfy  YlkLi  ak,i 

Eoo  2  „ 

k=i  orkji  <  oo. 


oo  and 
□ 


Assumption  5.4:  ak/ s  are  non-increasing  and  there  exists  a  positive  integer  p  and  a  posi¬ 
tive  scalar  A  such  that  X]fc=n(an,i  —  Qk.i)  <  Atpa^  i  for  all  positive  integers  n  and  t.  □ 


We  have  the  following  convergence  result  for  the  iterative  procedure  in  Equation  5.3  to 
update  the  parameters. 


Proposition  5.2:  Let  Assumptions  5.1,  5.2,  5.3,  and  5.4  hold,  and  let  6k  be  the  sequence 
of  parameter  vectors  generated  by  Equation  5.3.  Then,  the  estimation  of  performance  mea¬ 
sure  Xi{Ok)  converges  and  lim^oo  Vxi(^fc)  =  0  with  probability  1.  □ 


This  proposition  is  similar  to  Proposition  14  on  page  59  of  Marbach  (1998). 

Equation  5.3  provides  an  unbiased  estimate  of  V%j(0).  For  systems  involving  a  large 
state  space,  the  interval  between  visits  to  state  s*  can  be  large.  As  a  consequence,  the 
estimate  of  V\i(0)  might  have  a  large  variance.  Several  methods  have  been  proposed 
to  reduce  the  variance  in  this  estimation  and  yield  faster  convergence  (Marbach,  1998; 
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Baxter  and  Bartlett,  2001).  For  instance,  we  can  use  a  discount  factor  7  in  the  reward-to- 
go  estimation.  However,  these  methods  introduce  a  bias  into  the  estimate  of  Vx*(0).  For 
these  methods,  we  can  derive  a  modified  version  of  Equation  5.3  to  incrementally  update 
the  parameter  vector  along  the  approximate  gradient  direction. 


5.2  Hierarchical  Policy  Gradient  Algorithms 

After  decomposing  the  overall  task  to  a  set  of  subtasks  as  described  in  Chapter  3,  and 
formulating  each  subtask  in  the  hierarchy  as  an  episodic  PGRL  problem  as  illustrated  in 
Section  5.1,  we  can  use  the  update  Equation  5.3  and  derive  an  HPGRL  algorithm  to  maxi¬ 
mize  the  weighted  reward-to-go  for  every  subtask  in  the  hierarchy.  Algorithm  3  shows  the 
pseudo  code  for  this  algorithm. 


Algorithm  3  A  hierarchical  policy  gradient  algorithm  that  maximizes  the  weighted  reward- 
to-go  for  the  subtasks  in  the  hierarchy. 


1 

2: 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22: 


Function  HPGRLfTask  Mif  State  s) 

RR  =  0 

if  Mi  is  a  primitive  action  then 

execute  action  i  in  state  s,  observe  state  s'  and  reward  r(s,  i ) 
return  r(.s,  1) 

else  /*  Mi  is  a  non-primitive  subtask  */ 
while  Mi  has  not  terminated  (s  ^  s*)  do 
choose  action  a  using  policy  /q:(s;  0*) 

/?=HPGRL(Task  Ma,  State  s) 

observe  result  state  s'  and  internal  reward  f*(s,  a) 

if  s'  =  s*  then 

%k+l,i  0 

else 

_ 1  ^IM(a\s:ekJ) 

Zk+1’i  f  m(a\s;9k,i) 

end  if 

0fc+l,i  "F  [R  H-  g(s,  q)] 

RR  =  RR  +  R 
s  =  s' 

end  while 
end  if 
return  RR 
end  HPGRL 
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The  term  fj(s,  a)  on  Lines  10  and  16  of  the  algorithm  is  the  internal  reward  which  can 
be  used  only  inside  each  subtask  to  speed  up  its  local  learning  and  does  not  propagate  to  the 
upper  levels  in  the  hierarchy.  Lines  1 1  to  16  can  be  replaced  with  any  other  policy  gradient 
algorithm  to  optimize  weighted  reward-to-go,  such  as  those  presented  in  Marbach  (1998) 
or  Baxter  and  Bartlett  (2001).  Thus,  Algorithm  3  describes  a  family  of  HPGRL  algorithms 
to  maximize  the  weighted  reward-to-go  for  every  subtask  in  the  hierarchy. 

The  above  formulation  of  each  subtask  brings  the  following  limitations  for  the  learned 
policy:  1)  Parameterized  representation  of  a  policy  limits  the  policy  search  to  a  set  which 
is  typically  smaller  than  the  set  of  all  possible  policies.  2)  Gradient-based  policy  search 
methods  find  a  solution  which  is  locally,  rather  than  globally,  optimal.  Thus,  in  general,  the 
family  of  algorithms  described  above  converges  to  a  recursively  local  optimal  policy.  If 
the  policy  learned  for  every  subtask  in  the  hierarchy  coincides  with  the  best  policies,  then 
these  algorithms  converge  to  a  recursively  optimal  policy. 

5.2.1  Taxi-Fuel  Problem 

In  this  section,  we  apply  the  HPGRL  algorithm  to  the  taxi-fuel  problem  introduced  in 
Dietterich  (1998),  and  compare  its  performance  with  MAXQ-Q,  a  value  function  hierar¬ 
chical  RL  algorithm  (Dietterich,  2000),  and  flat  Q-leaming. 

A  5-by-5  grid  world  inhabited  by  a  taxi  is  shown  in  Figure  5.2.  There  are  four  stations 
marked  as  B(lue),  G(reen),  R(ed),  and  Y(ellow).  The  task  is  episodic.  In  each  episode,  the 
taxi  starts  in  a  randomly  chosen  location  and  with  a  randomly  chosen  amount  of  fuel  rang¬ 
ing  from  5  to  12  units.  There  is  a  passenger  at  one  of  the  four  stations  (chosen  randomly), 
and  that  passenger  wishes  to  be  transported  to  one  of  the  other  three  stations  (also  chosen 
randomly).  The  taxi  must  go  to  the  passenger’s  location,  pick  up  the  passenger,  go  to  its 
destination  location  and  drop  off  the  passenger  there.  The  episode  ends  when  the  passenger 
is  deposited  at  its  destination  station  or  taxi  goes  out  of  fuel.  There  are  8,  750  possible  states 
and  7  primitive  actions  in  the  domain,  Pickup ,  Dropoff,  Fillup,  and  four  navigation  actions 
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(each  of  these  consumes  one  unit  of  fuel).  Each  action  is  deterministic.  There  is  a  reward 
of  —1  for  each  action  and  an  additional  reward  of  20  for  successfully  delivering  the  passen¬ 
ger.  There  is  a  reward  of  —10  if  the  taxi  attempts  to  execute  the  Dropoff  or  Pickup  actions 
illegally,  and  a  reward  of  —20  if  the  fuel  level  falls  below  zero.  The  system  performance 
is  measured  in  terms  of  the  average  reward  per  step  which  is  equivalent  to  maximizing  the 
total  reward  per  episode  in  this  task.  Each  experiment  was  conducted  ten  times  and  the 
results  averaged. 


T  : 
B  : 
G 
R 
Y  : 
F  : 


Taxi 

Blue  Station 
Green  Station 
Red  Station 
Yellow  Station 
Gas  Station 


0  12  3  4 

Figure  5.2.  The  taxi-fuel  problem. 


Figure  5.3  compares  the  performance  of  HPGRL,  MAXQ-Q  and  flat  Q-learning  algo¬ 
rithms  on  the  taxi-fuel  problem.4  The  hierarchical  policy  gradient  algorithm  used  in  this 
experiment  is  the  one  shown  in  Algorithm  3,  with  one  policy  parameter  for  each  state- 
action  pair  (s,  a).  The  graph  shows  that  MAXQ-Q  converges  faster  than  HPGRL  and  flat 
Q-learning,  and  HPGRL  is  slightly  faster  than  flat  Q-learning. 

As  we  expected,  the  HPGRL  algorithm  converges  to  the  same  performance  as  MAXQ- 
Q.  However,  it  is  much  slower  than  its  value  function  based  counterpart.  The  performance 
of  HPGRL  can  be  improved  by  better  policy  formulation  and  using  more  sophisticated 
policy  gradient  algorithms  for  each  subtask.  The  slow  convergence  of  HPGRL  algorithms 

4Both  HPGRL  and  MAXQ-Q  utilize  the  hierarchical  task  decomposition  used  in  Dietterich  (1998). 
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Figure  5.3.  This  figure  compares  the  performance  of  the  HPGRL  algorithm  proposed  in 
this  section  with  MAXQ-Q  and  flat  Q-leaming  algorithms  on  the  taxi-fuel  problem. 


motivates  us  to  use  both  VFRL  and  PGRL  methods  in  a  hierarchy.  We  address  this  by 
introducing  hierarchical  hybrid  algorithms  in  the  next  section. 

5.3  Hierarchical  Hybrid  Algorithms 

Despite  the  methods  proposed  to  reduce  the  variance  of  gradient  estimators  in  PGRL 
algorithms,  these  algorithms  are  still  slower  than  VFRL  methods  as  shown  in  the  simple 
taxi-fuel  experiment  in  Section  5.2.1.  We  accelerate  learning  of  HPGRL  algorithms  by 
formulating  those  subtasks  with  smaller  state  spaces  and  finite  action  spaces  usually  located 
at  the  high  levels  of  the  hierarchy  as  VFRL  problems,  and  those  with  large  state  spaces 
and/or  infinite  action  spaces  usually  located  at  the  low  levels  of  the  hierarchy  as  PGRL 
problems.  This  formulation  can  benefit  from  the  faster  convergence  of  VFRL  methods 
and  the  power  of  PGRL  algorithms  in  domains  with  infinite  state  and/or  action  spaces  at 
the  same  time.  We  call  this  family  of  algorithms,  hierarchical  hybrid  algorithms  and 
illustrate  them  using  a  ship  steering  task. 
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Figure  5.4  shows  a  ship  steering  task  (Miller  et  al.,  1990).  A  ship  starts  at  a  randomly 
chosen  position,  orientation,  and  turning  rate.  Its  goal  is  to  be  maneuvered  at  a  constant 
speed  through  a  gate  placed  at  a  fixed  position.  The  ship  does  not  know  the  location  of  the 
gate  and  observes  the  gate  only  when  it  passes  through  it. 

1  km 

y 


0 

0  x  1  km 

Figure  5.4.  The  ship  steering  task. 


Equations  5.4  gives  the  motion  equations  of  the  ship,  where  T  =  5  is  the  time  constant 
of  convergence  to  desired  turning  rate,  V  =  3  m/sec  is  the  constant  speed  of  the  ship,  and 
A  =  0.2  sec  is  the  sampling  interval.  There  is  a  time  lag  between  changes  in  the  desired 
turning  rate  and  the  actual  turning  rate,  modeling  the  effects  of  a  real  ship’s  inertia  and  the 
resistance  of  the  water. 


x[t  +  1]  =  x[t]  +  AV  sin  9[t] 
y[t  +  1]  =  y[t]  +  AV  cos0[t] 

(5.4) 

9[t  +  1]  =  9[t]  +  A  9[t] 

9[t  +  1]  =  9[t]  +  A(r[t]  -  9[t])/T 

At  each  time  t,  the  state  of  the  ship  is  given  by  its  position  x[t\  and  y\t],  orientation 
9[t]  and  actual  turning  rate  0\t\.  The  action  is  the  desired  turning  rate  of  the  ship  r[t].  All 
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State 

X 

0  to  1000  meters 

y 

0  to  1000  meters 

9 

-180  to  180  degrees 

9 

-15  to  15  degrees/sec 

Action 

r 

-15  to  15  degrees/sec 

Table  5.1.  State  and  action  variables  for  the  ship  steering  task. 

four  state  variables  and  also  the  action  are  continuous  and  their  range  is  shown  in  Table 
5.1.  The  ship  steering  problem  is  episodic.  In  each  episode,  the  goal  is  learning  to  generate 
sequences  of  actions  that  steer  the  center  of  the  ship  through  the  gate  in  the  minimum 
amount  of  time.  The  sides  of  the  gate  are  placed  at  coordinates  (350,400)  and  (450,400). 
If  the  ship  moves  out  of  bound  (x  <  0  or  x  >  1000  or  y  <  0  or  y  >  1000),  the  episode 
terminates  and  is  considered  as  a  failure. 

We  applied  both  a  flat  PGRL  algorithm  and  an  actor-critic  algorithm  (Konda,  2002)  to 
this  task  without  achieving  a  good  performance  in  a  reasonable  amount  of  time.  Figure  5.7 
shows  that  after  learning  for  50,  000  episodes,  these  algorithms  are  able  to  control  the  ship 
to  successfully  pass  through  the  gate  only  60  percent  of  time.  We  believe  this  occurred 
due  to  two  reasons,  which  make  this  problem  hard  to  learn.  First,  since  the  ship  cannot 
turn  faster  than  15  degrees/sec,  all  state  variables  change  only  by  a  small  amount  at  each 
control  interval.  Thus,  we  need  a  high  resolution  discretization  of  the  state  space  in  order 
to  accurately  model  state  transitions,  which  requires  a  large  number  of  parameters  for  the 
function  approximator  and  makes  the  problem  intractable.  Second,  there  is  a  time  lag 
between  changes  in  the  desired  turning  rate  r  and  the  actual  turning  rate  9,  ship’s  position 
x ,  y,  and  orientation  9,  which  requires  the  controller  to  deal  with  long  delays. 

However,  we  successfully  applied  a  flat  policy  gradient  algorithm  to  the  simplified  ver¬ 
sions  of  this  problem  shown  in  Figure  5.5,  when  x  and  y  change  from  0  to  150  instead  of 
0  to  1000,  the  ship  always  starts  at  a  fixed  position  (initial  positions  in  Figure  5.5)  with 
randomly  chosen  orientation  and  turning  rate,  and  the  goal  is  to  reach  to  a  neighborhood  of 
a  pre-defined  point  (goals  in  Figure  5.5).  It  indicates  that  this  high-dimensional  non-linear 
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control  problem  can  be  learned  using  an  appropriate  hierarchical  decomposition.  Using  this 
prior  knowledge,  we  decompose  the  problem  into  two  levels  using  the  task  graph  shown 
in  Figure  5.6.  At  the  high-level,  the  agent  learns  to  select  among  four  diagonal  and  four 
horizontal/vertical  subtasks.  At  the  low-level,  each  low-level  subtask  leams  a  sequence  of 
turning  rates  to  achieve  its  own  goal.  We  use  symmetry  and  map  eight  subtasks  located 
below  the  root  to  only  two  subtasks  at  the  low-level,  one  associated  with  four  diagonal 
subtasks  and  one  associated  with  four  horizontal/vertical  subtasks  as  shown  in  Figure  5.6. 
We  call  them  diagonal  and  horizontal/vertical  subtasks. 


150  m 


0 


Goal  (140,75) 


Initial  Position  (40,75) 


0  X  150  m 

Figure  5.5.  This  figure  shows  two  simplified  versions  of  the  ship  steering  task  used  as 
low-level  subtasks  in  the  hierarchical  decomposition  of  the  ship  steering  problem. 


The  flat  PGRL  algorithm  used  in  this  section  uses  Equation  5.3  and  CMAC  function 
approximator  with  9  four-dimensional  tilings,  dividing  the  space  into  20  x  20  x  36  x  5  = 
72,  000  tiles  each.  The  actor-critic  algorithm  also  uses  the  above  function  approximator  for 
its  actor,  and  9  five  dimensional  tilings  of  size  5x5x  36  x5x  30  =  135,  000  tiles  for  its 
critic.  The  fifth  dimension  of  critic’s  tilings  is  for  the  continuous  action. 

In  the  hierarchical  hybrid  algorithm,  we  decompose  the  task  using  the  task  graph  in 
Figure  5.6.  At  the  high-level,  the  learner  explores  in  a  low-dimensional  sub-space  of  the 
original  high-dimensional  state  space.  The  state  variables  are  only  the  coordinates  of  the 
ship  x  and  y  with  the  full  range  from  0  to  1000.  The  actions  are  four  diagonal  and  four 
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Figure  5.6.  A  task  graph  for  the  ship  steering  problem. 


horizontal/vertical  subtasks  similar  to  those  subtasks  shown  in  Figure  5.5.  The  state  space 
is  coarsely  discretized  into  400  states.  We  use  the  value-based  Q(A)  algorithm  with  e-greedy 
action  selection  and  replacing  traces  to  learn  a  sequence  of  diagonal  and  horizontal/vertical 
subtasks  to  achieve  the  goal  of  the  entire  task  (passing  through  the  gate).  Each  episode 
ends  when  the  ship  passes  through  the  gate  or  moves  out  of  bound.  Then  the  new  episode 
starts  with  the  ship  in  a  randomly  chosen  position,  orientation,  and  turning  rate.  In  this 
algorithm,  A  is  set  to  0.9,  learning  rate  to  0.1,  and  e  starts  with  0.1  remains  unchanged  until 
the  performances  of  low-level  subtasks  reach  to  a  certain  level  and  then  is  decreased  by  a 
factor  of  1.01  every  50  episodes. 

At  the  low-level,  the  learner  explores  local  areas  of  the  high-dimensional  state  space 
without  discretization.  When  the  high-level  learner  selects  one  of  the  low-level  subtasks, 
the  low-level  subtask  takes  control  and  executes  the  following  steps  as  shown  in  Figure  5.5. 
1)  Maps  the  ship  to  a  new  coordinate  system  in  which  the  ship  is  in  position  (40, 40)  for  the 
diagonal  subtask  and  (40,  75)  for  the  horizontal/ vertical  subtask.  2)  Sets  the  low-level  goal 
to  position  (140, 140)  for  the  diagonal  subtask  and  (140,  75)  for  the  horizontal/vertical  sub¬ 
task.  3)  Sets  the  low-level  boundaries  to  0  <  x  ,  y  <  150.  4)  Generates  primitive  actions 
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until  either  the  ship  reaches  to  a  neighborhood  of  the  low-level  goal,  a  circle  with  radius  10 
around  the  low-level  goal  (success),  or  moves  out  of  the  low-level  bounds  (failure). 

The  two  low-level  subtasks  use  all  four  state  variables,  however  the  range  of  coordina¬ 
tion  variables  x  and  y  is  0  to  150  instead  of  0  to  1000.  Their  action  variable  is  the  desired 
turning  rate  of  the  ship,  which  is  a  continuous  variable  with  range  —15  to  15  degrees/ sec. 
The  control  interval  is  0.6  sec  (three  times  the  sampling  interval  A  =  0.2  sec).  They  use 
the  PGRL  algorithm  on  Lines  11  to  16  of  Algorithm  3  to  update  their  parameters.  In  addi¬ 
tion,  they  use  a  CMAC  function  approximator  with  9  four  dimensional  tilings,  dividing  the 
space  into  5x5x  36  x5  =  4,  500  tiles  each.  One  parameter  w  is  defined  for  each  tile  and 
the  parameterized  policy  is  a  Gaussian: 


p(s,  a,  W) 


1 

V2n 


where  N  =  9  x  4,  500  =  40,  500  is  the  total  number  of  tiles  and  0;  is  1  if  state  s  falls  in 
tile  i  and  0  otherwise.  The  actual  action  is  generated  after  mapping  the  value  chosen  by  the 
Gaussian  policy  to  the  range  from  —15  to  15  degrees  j sec  using  a  sigmoid  function. 

In  addition  to  the  original  reward  of  —1  per  step,  we  define  internal  rewards  100  and 
—  100  for  low-level  success  and  failure,  and  a  reward  according  to  the  distance  of  the  current 
ship  orientation  6  to  the  angle  between  the  current  position  and  low-level  goal  6  given  by 


G  =  exp 


l|0~g||2>\ 

30  x  30  J 


-  1 


where  30  degrees  gives  the  width  of  the  reward  function.  When  a  low-level  subtask  termi¬ 
nates,  the  only  reward  that  propagates  to  the  high-level  is  the  summation  of  all  —1  rewards 
per  step.  In  addition  to  reward  received  from  low-level,  high-level  uses  a  reward  100  upon 
successfully  passing  through  the  gate. 

We  trained  the  system  for  50,  000  episodes.  In  each  episode,  the  high-level  learner 
(controller  located  at  root)  selects  a  low-level  subtask,  and  the  selected  low-level  subtask 
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is  executed  until  it  successfully  terminates  (ship  reaches  the  low-level  goal)  or  it  fails  (ship 
goes  out  of  the  low-level  bounds).  Then  control  returns  to  the  high-level  subtask  (root) 
again.  The  following  results  are  averaged  over  five  simulation  runs. 

Figure  5.7  compares  the  performance  of  the  hierarchical  hybrid  algorithm  with  flat 
PGRL  and  actor-critic  algorithms  in  terms  of  the  number  of  successful  trials  in  1000 
episodes.  As  this  figure  shows,  despite  the  high  resolution  function  approximators  used 
in  both  flat  algorithms,  their  performance  is  worse  than  the  hierarchical  hybrid  algorithm. 
Moreover,  their  computation  time  per  step  is  also  much  more  than  the  hierarchical  hybrid 
algorithm,  due  to  the  large  number  of  parameters  to  be  learned. 


Figure  5.7.  This  figure  shows  the  performance  of  hierarchical  hybrid ,  flat  PGRL  and 
actor-critic  algorithms  in  terms  of  the  number  of  successful  trials  in  1000  episodes. 


Figure  5.8  demonstrates  the  performance  of  the  hierarchical  hybrid  algorithm  in  terms 
of  the  average  number  of  low-level  subtask  calls.  This  figure  shows  that  after  learning,  the 
learner  executes  about  4  low-level  subtasks  (diagonal  or  horizontal/vertical  subtasks)  per 
episode. 
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Figure  5.8.  This  figure  shows  the  performance  of  the  hierarchical  hybrid  algorithm  in 
terms  of  the  number  of  low-level  subtask  calls. 


Figure  5.9  compares  the  performance  of  hierarchical  hybrid ,  flat  PGRL  and  actor-critic 
algorithms  in  terms  of  the  average  number  of  steps  to  goal  (averaged  over  1000  episodes). 
This  figure  shows  that  after  learning,  it  takes  about  220  primitive  actions  (turn  actions)  for 
the  hierarchical  hybrid  learner  to  pass  through  the  gate. 

Figures  5.10  and  5.11  show  the  performance  of  the  diagonal  and  horizontal/vertical 
subtasks  in  terms  of  the  number  of  success  out  of  1000  executions  respectively. 

Finally,  Figure  5.12  demonstrates  the  learned  policy  for  two  sample  initial  configu¬ 
rations  of  the  ship  shown  with  big  circles.  The  upper  configuration  is  x  =  700  ,  y  = 
700  ,  6  =  100  ,  9  =  3.65  and  the  lower  one  is  x  =  750  ,  y  =  180  ,  6  =  80  ,  9  =  7.9. 
The  low-level  subtasks  chosen  by  the  agent  at  the  high-level  are  shown  by  small  circles  in 
this  figure. 
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Figure  5.9.  This  figure  shows  the  performance  of  hierarchical  hybrid ,  flat  PGRL  and 
actor-critic  algorithms  in  terms  of  the  number  of  steps  to  pass  through  the  gate. 


Figure  5.10.  This  figure  shows  the  performance  of  the  diagonal  subtask  in  terms  of  the 
number  of  successful  trials  in  1000  episodes. 
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Figure  5.11.  This  figure  shows  the  performance  of  the  horizontal/vertical  subtask  in  terms 
of  the  number  of  successful  trials  in  1000  episodes. 


Figure  5.12.  This  figure  shows  the  learned  policy  for  two  initial  configurations  of  the  ship. 


103 


5.4  Summary  and  Future  Work 

In  this  chapter,  we  described  HPGRL,  a  family  of  hierarchical  policy  gradient  RL  al¬ 
gorithms  for  learning  in  domains  with  continuous  state  and/or  action  spaces.  We  compared 
the  performance  of  this  family  of  algorithms  with  a  hierarchical  VFRL  algorithm  and  a 
flat  RL  algorithm  in  a  simple  taxi-fuel  problem.  The  results  demonstrate  that  the  HPGRL 
algorithm  converges  slower  than  the  hierarchical  VFRL  algorithm.  To  accelerate  learning 
in  HPGRL  algorithms,  we  proposed  a  family  of  hierarchical  hybrid  algorithms  in  which 
subtasks  located  at  high  level(s)  of  the  hierarchy  are  formulated  as  VFRL,  and  subtasks  lo¬ 
cated  at  low  level(s)  of  the  hierarchy  are  defined  as  PGRL  problems.  We  use  a  continuous 
state  and  action  ship  steering  task  to  illustrate  this  family  of  algorithms  and  to  demonstrate 
their  performance. 

The  algorithms  proposed  in  this  chapter  are  based  on  the  assumption  that  the  overall 
task  ( root  of  the  hierarchy)  is  episodic.  One  direction  for  future  work  is  to  reformulate  the 
algorithms  presented  in  this  chapter  for  the  case  when  the  overall  task  is  continuing.  In 
this  case,  the  root  task  is  formulated  as  a  continuing  problem  with  the  average  reward  as 
its  performance  function.  Since  the  policy  learned  at  root  involves  policies  of  its  children, 
the  type  of  optimality  achieved  at  root  depends  on  how  we  formulate  other  subtasks  in 
the  hierarchy.  Different  notions  of  optimality  in  hierarchical  average  reward  presented  in 
Chapter  4  can  be  used  to  develop  new  HPGRL  algorithms  for  continuing  problems. 

Although  the  proposed  algorithms  give  us  the  ability  to  deal  with  continuous  state  and 
continuous  action  spaces,  they  are  not  still  appropriate  to  efficiently  control  real-world 
problems  in  which  the  speed  of  learning  is  crucial.  The  results  of  ship  steering  task  indicate 
that  in  order  to  apply  the  proposed  algorithms  to  real-world  domains,  more  powerful  PGRL 
algorithms  are  needed  to  be  developed  —  PGRL  algorithms  that  need  a  smaller  number  of 
samples  to  learn  a  good  policy,  and  are  less  computationally  expensive. 
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CHAPTER  6 


HIERARCHICAL  MULTI- AGENT  REINFORCEMENT 

LEARNING 


In  this  chapter,1  we  investigate  the  use  of  hierarchical  reinforcement  learning  (HRL) 
to  speed  up  the  acquisition  of  cooperative  multi-agent  tasks.  Our  approach  to  learning  in 
cooperative  multi-agent  domains  differs  from  all  the  approaches  discussed  in  Section  2.5  in 
one  key  respect,  namely  the  use  of  hierarchy  to  speed  up  multi-agent  reinforcement  learn¬ 
ing.  The  key  idea  underlying  our  approach  is  that  coordination  skills  are  learned  much  more 
efficiently  if  the  agents  have  a  hierarchical  representation  of  the  task  structure.  Algorithms 
for  learning  task-level  coordination  have  also  been  developed  in  non-MDP  approaches,  see 
Sugawara  and  Lesser  (1998).  We  first  introduce  a  hierarchical  multi-agent  RL  framework. 
In  this  framework,  we  assume  agents  are  cooperative  and  each  agent  is  given  an  initial 
hierarchical  decomposition  of  the  overall  task.  Moreover,  agents  are  homogeneous,  i.e., 
use  the  same  hierarchical  task  decomposition.  However,  learning  is  decentralized,  with 
each  agent  learning  three  interrelated  skills:  how  to  perform  subtasks,  which  order  to  do 
them  in,  and  how  to  coordinate  with  other  agents.  The  use  of  hierarchy  speeds  up  learning 
in  multi-agent  domains  by  making  it  possible  to  leam  coordination  skills  at  the  level  of 
subtasks  instead  of  primitive  actions.  We  define  cooperative  subtasks  to  be  those  subtasks 
in  which  coordination  among  agents  significantly  improves  the  performance  of  the  over- 


1  Most  of  the  work  presented  in  this  chapter  first  appeared  in  1)  Makar,  Mahadevan  and  Ghavamzadeh 
(2001),  ‘Hierarchical  multi-agent  reinforcement  learning,” Proceedings  of  the  Fifth  International  Conference 
on  Autonomous  Agents,  pp.  246-253,  and  2)  Ghavamzadeh  and  Mahadevan  (2004),  ‘Learning  to  Com¬ 
municate  and  Act  using  Hierarchical  Reinforcement  Learning,”  Proceedings  of  the  Third  International  Joint 
Conference  on  Autonomous  Agents  and  Multi-Agent  Systems,  pp.  1 1 14-1121.  A  longer  version  of  this  work 
has  also  been  submitted  to  the  Journal  of  Autonomous  Agents  and  Multi-Agent  Systems. 
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all  task.  Agents  cooperate  with  their  teammates  at  cooperative  subtasks  and  ignore  them 
while  performing  non-cooperative  subtasks.  Those  levels  of  the  hierarchy  which  include 
cooperative  subtasks  are  called  cooperation  levels.  Since  high-level  coordination  allows 
for  increased  cooperation  skills  as  agents  do  not  get  confused  by  low-level  details,  we  usu¬ 
ally  define  cooperative  subtasks  at  high  level(s)  of  the  hierarchy.  The  proposed  hierarchical 
approach  allows  agents  to  learn  coordination  faster  by  sharing  information  at  the  level  of 
cooperative  subtasks,  rather  than  attempting  to  leam  coordination  at  the  level  of  primitive 
actions.  We  initially  assume  that  communication  is  free  and  propose  a  hierarchical  multi¬ 
agent  RL  algorithm  called  Cooperative  HRL.  In  Section  6.4,  we  use  a  large  four-agent 
AGV  scheduling  problem  as  the  experimental  testbed  and  compare  the  performance  of  the 
Cooperative  HRL  algorithm  with  selfish  HRL,  as  well  as  single-agent  HRL  and  standard 
Q-learning  algorithms.  We  also  show  that  the  Cooperative  HRL  outperforms  widely  used 
industrial  heuristics,  such  as  “first  come  first  serx’e”,  “highest  queue  first”  and  “nearest 
station  first”  in  this  problem. 

Later  in  this  chapter,  we  address  the  issue  of  rational  communication  among  autonomous 
agents,  which  is  important  when  communication  is  costly.  The  goal  is  for  agents  to  leam 
both  action  and  communication  policies  that  together  optimize  the  task  given  the  com¬ 
munication  cost.  We  extend  the  Cooperative  HRL  algorithm  to  include  communication 
decisions  and  propose  a  cooperative  multi-agent  HRL  algorithm  called  COM-Cooperative 
HRL.  In  this  algorithm,  we  add  a  communication  level  to  the  hierarchical  decomposition 
of  the  problem  below  each  cooperation  level.  Before  making  a  decision  at  a  cooperative 
subtask,  agents  decide  if  it  is  worthwhile  to  perform  a  communication  action.  A  communi¬ 
cation  action  has  a  certain  cost  and  provides  each  agent  at  a  certain  cooperation  level  with 
the  actions  selected  by  the  other  agents  at  the  same  level.  We  demonstrate  the  efficacy  of 
the  COM-Cooperative  HRL  algorithm  as  well  as  the  relation  between  the  communication 
cost  and  the  learned  communication  policy  using  a  multi-agent  taxi  problem. 
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The  rest  of  this  chapter  is  organized  as  follows.  In  Section  6.1,  we  introduce  the  multi¬ 
agent  SMDP  model,  which  is  an  extension  of  the  SMDP  model  to  cooperative  multi-agent 
domains.  Section  6.2  describes  the  hierarchical  multi-agent  RL  framework  which  is  used 
in  the  algorithms  proposed  in  this  chapter.  In  Sections  6.3  and  6.4,  we  introduce  the 
Cooperative  HRL  algorithm  and  present  the  experimental  results  of  using  this  algorithm 
in  a  four-agent  AGV  scheduling  problem.  In  Section  6.5,  we  illustrate  how  to  incorpo¬ 
rate  communication  decisions  in  the  Cooperative  HRL  algorithm.  In  this  section,  after  a 
brief  introduction  of  communication  among  agents  in  Section  6.5.1,  we  illustrate  the  COM- 
Cooperative  HRL  algorithm  in  Section  6.5.2.  Section  6.6  presents  experimental  results  of 
using  the  COM-Cooperative  HRL  algorithm  in  a  multi-agent  taxi  domain.  Finally,  Section 
6.7  summarizes  the  chapter  and  discusses  some  directions  for  future  work.  The  multi-agent 
version  of  the  robot  trash  collection  task  described  in  Chapter  3  will  serve  as  our  example 
domain  throughout  this  chapter.  The  multi-agent  trash  collection  task  and  its  task  graph  are 
shown  in  Figure  6.1. 


Al,  A2  :  Agents 

T1  :  Location  of  the  first  trash  can 
T2  :  Location  of  the  second  trash  can 
Dump  :  Location  to  deposit  all  trash 


Figure  6.1.  A  multi-agent  trash  collection  task  and  its  associated  task  graph. 


6.1  Multi-Agent  SMDP  Model 

In  this  section,  we  extend  the  SMDP  model  described  in  Section  2.3  to  multi-agent  do¬ 
mains  when  a  team  of  agents  controls  the  process,  and  introduce  the  multi-agent  SMDP 
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(MSMDP)  model.  We  assume  agents  are  cooperative,  i.e.,  maximize  the  same  utility  over 
an  extended  period  of  time.  The  individual  actions  of  agents  interact  in  that  the  effect 
of  one  agent’s  action  may  depend  on  the  actions  taken  by  the  others.  When  a  group  of 
agents  perform  temporally  extended  actions,  these  actions  may  not  terminate  at  the  same 
time.  Therefore,  unlike  the  multi-agent  extension  of  an  MDP,  the  MMDP  model  (Boutilier, 
1999),  the  multi-agent  extension  of  SMDP  requires  extending  the  notion  of  a  decision  mak¬ 
ing  event. 

Definition  6.1:  An  MSMDP  consists  of  six  components  (T,  S,  A,  V,  7Z,  I,  T),  which  are 
defined  as  follows: 

The  set  T  is  a  finite  collection  of  n  agents,  with  each  agent  j  £  T  having  a  finite  set  A J  of 
individual  actions.  An  element  a  —  (a1,...  ,  an)  of  the  joint-action  space  A  =  117=  i  A? 
represents  the  concurrent  execution  of  actions  aJ  by  each  agent  j,  j  —  !)•••  ,n.  The  com¬ 
ponents  S,  TZ,  /,  and  V  are  as  in  an  SMDP,  the  set  of  states  of  the  system  being  controlled, 
the  reward  function  mapping  S  — >  IR,  the  initial  state  distribution  /  :  S  — >  [0,1],  and  the 
state  and  action  dependent  multi-step  transition  probability  function  'P:5xlNx5x.4.-^ 
[0, 1].  The  term  P(s',  iVjs,  a)  denotes  the  probability  that  joint- action  a  will  cause  the  sys¬ 
tem  to  transition  from  state  s  to  state  s'  in  N  time  steps.  Since  the  components  of  a  joint- 
action  are  temporally  extended  actions,  they  may  not  terminate  at  the  same  time.  Therefore, 
the  multi-step  transition  probability  P  depends  on  how  we  define  decision  epochs  and  as 
a  result,  depends  on  the  termination  scheme  T.  Three  termination  strategies  rany,  ra«, 
and  Tcontinue  for  temporally  extended  joint-actions  were  introduced  and  analyzed  in  Ro- 
hanimanesh  and  Mahadevan  (2003).  In  Tany  termination  scheme,  the  next  decision  epoch 
is  when  the  first  action  within  the  joint-action  currently  being  executed  terminates,  where 
the  rest  of  the  actions  that  did  not  terminate  are  interrupted.  When  an  agent  completes  an 
action  (e.g.,  finishes  collect  trash  at  T 1  by  putting  trash  in  Dump),  all  other  agents  inter- 
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rupt  their  actions,  the  next  decision  epoch  occurs,  and  a  new  joint-action  is  selected  (e.g., 
agent  A 1  chooses  to  collect  trash  at  T 2  and  agent  A 2  decides  to  collect  trash  at  T 1).  In 
Tau  termination  scheme,  the  next  decision  epoch  is  the  earliest  time  at  which  all  the  actions 
within  the  joint-action  currently  being  executed  have  terminated.  When  an  agent  completes 
an  action,  it  waits  (takes  the  idle  action)  until  all  the  other  agents  finish  their  current  ac¬ 
tions.  Then,  next  decision  epoch  occurs  and  agents  choose  next  joint-action  together.  In 
both  these  termination  strategies,  all  agents  make  decision  at  every  decision  epoch.  The 
r continue  termination  scheme  is  similar  to  rany  in  the  sense  that  the  next  decision  epoch  is 
when  the  first  action  within  the  joint-action  currently  being  executed  terminates.  However, 
the  other  agents  are  not  interrupted  and  only  terminated  agents  select  new  actions.  In  this 
termination  strategy,  only  a  subset  of  agents  choose  action  at  each  decision  epoch.  When 
an  agent  completes  an  action,  next  decision  epoch  occurs  only  for  that  agent  and  it  selects 
its  next  action  given  the  actions  being  performed  by  the  other  agents.  □ 

The  three  termination  strategies  described  above  are  the  most  common,  but  not  the  only 
termination  schemes  in  cooperative  multi-agent  activities.  A  wide  range  of  termination 
strategies  can  be  defined  based  on  them.  Of  course,  not  all  these  strategies  are  appropriate 
for  any  given  multi-agent  task.  We  categorize  termination  strategies  as  synchronous  and 
asynchronous.  In  synchronous  schemes,  such  as  rany  and  Tau,  all  agents  make  a  decision 
at  every  decision  epoch  and  therefore  we  need  a  centralized  mechanism  to  synchronize 
agents  at  decision  epochs.  In  asynchronous  strategies,  such  as  rcon4ume,  only  a  subset  of 
agents  make  decision  at  each  decision  epoch.  In  this  case,  there  is  no  need  for  a  centralized 
mechanism  to  synchronize  agents  and  decision  making  can  take  place  in  a  decentralized 
fashion.  Since  our  goal  is  to  design  decentralized  multi-agent  RL  algorithms,  we  use  the 
r continue  termination  scheme  for  joint- action  selection  in  the  hierarchical  multi-agent  model 
and  algorithms  presented  in  this  chapter. 
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6.2  A  Hierarchical  Multi- Agent  Reinforcement  Learning  Framework 

In  our  hierarchical  multi-agent  framework,  we  assume  that  there  are  n  agents  in  the 
environment,  cooperating  with  each  other  to  accomplish  a  task.  The  designer  of  the  system 
uses  her/his  domain  knowledge  to  recursively  decompose  the  overall  task  into  a  collection 
of  sub  tasks  that  she/he  believes  are  important  for  solving  the  problem.  We  assume  that 
agents  are  homogeneous,  i.e.,  all  agents  are  given  the  same  task  hierarchy.2  At  each  level 
of  the  hierarchy,  the  designer  of  the  system  defines  cooperative  subtasks  to  be  those  sub¬ 
tasks  in  which  coordination  among  agents  significantly  increases  the  performance  of  the 
overall  task.  The  set  of  all  cooperative  subtasks  at  a  certain  level  of  the  hierarchy  is  called 
the  cooperation  set  of  that  level.  Each  level  of  the  hierarchy  with  non-empty  cooperation 
set  is  called  a  cooperation  level.  The  union  of  the  children  of  the  Zth  level  cooperative 
subtasks  is  represented  by  Ui.  Since  high-level  coordination  allows  for  increased  coopera¬ 
tion  skills  as  agents  do  not  get  confused  by  low-level  details,  we  usually  define  cooperative 
subtasks  at  the  highest  level(s)  of  the  hierarchy.  Agents  actively  coordinate  while  making 
decision  in  cooperative  subtasks  and  are  ignorant  about  other  agents  in  non-cooperative 
subtasks.  Thus,  we  configure  cooperative  subtasks  to  model  joint-action  values.  In  the 
trash  collection  problem,  we  define  root  as  a  cooperative  subtask.  As  a  result,  the  top-level 
of  the  hierarchy  is  a  cooperation  level,  root  is  the  only  member  of  the  cooperation  set  at 
the  top-level,  and  U\  consists  of  all  subtasks  located  at  the  second  level  of  the  hierarchy, 
JJ\  =  {collect  trash  at  Tl,  collect  trash  at  T2}  (see  Figure  6.1).  As  it  is  clear  in  this  prob¬ 
lem,  it  is  more  efficient  that  an  agent  leams  high-level  coordination  knowledge  (what  is  the 
utility  of  agent  A2  collecting  trash  from  trash  can  Tl  if  agent  A 1  is  collecting  trash  from 
trash  can  T2),  rather  than  learning  its  response  to  low-level  primitive  actions  of  other  agents 
(what  agent  A2  should  do  if  agent  Al  aligns  with  wall).  Therefore,  we  define  single-agent 
policies  for  non-cooperative  subtasks  and  joint  policies  for  cooperative  subtasks. 

2Studying  the  heterogeneous  case  where  agents  are  given  dissimilar  decompositions  of  the  overall  task 
would  be  more  challenging  and  beyond  the  scope  of  this  dissertation. 
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Definition  6.2:  Under  a  hierarchical  policy  //,  each  non-cooperative  subtask  M,  can  be 
modeled  by  an  SMDP  consisting  of  components  (S';,  At,  Pf,  R4).  □ 

Definition  6.3:  Under  a  hierarchical  policy  //,  each  cooperative  subtask  Mi  located  at 
the  Zth  level  of  the  hierarchy  can  be  modeled  by  an  MSMDP  as  follows: 

T  is  the  set  of  n  agents  in  the  team.  We  assume  that  agents  have  only  local  state  infor¬ 
mation  and  ignore  the  states  of  the  other  agents.  Therefore,  the  state  space  5;  is  defined  as 
the  single-agent  state  space  S';  (not  joint-state  space).  This  is  certainly  an  approximation 
but  greatly  simplifies  the  underlying  multi-agent  RL  problem.  This  approximation  is  based 
on  the  fact  that  an  agent  can  get  a  rough  idea  of  what  state  the  other  agents  might  be  in  just 
by  knowing  the  high-level  actions  being  performed  by  them.  The  action  space  is  joint  and 
is  defined  as  At  =  A{  x  (f/;)n_1,  where  Ui  =  Ua-=i  ^  is  the  union  of  the  action  sets  of 
all  the  Zth  level  cooperative  subtasks,  and  m  is  the  cardinality  of  the  Zth  level  cooperation 
set.  For  the  cooperative  subtask  root  in  the  trash  collection  problem,  the  set  of  agents  is 
T  =  {Al,A2}  and  its  joint-action  space,  Aroot,  is  specified  as  the  cross  product  of  its  ac¬ 
tion  set,  Aroot,  and  U\ ,  AVOot  =  Aroot  x  U\ .  Finally,  since  we  are  interested  in  decentralized 
control,  we  use  the  Tconunue  termination  strategy.  Therefore,  when  an  agent  terminates  a 
subtask,  the  next  decision  epoch  occurs  only  for  that  agent  and  it  selects  its  next  action 
given  the  information  about  the  other  agents.  □ 

This  cooperative  multi-agent  approach  has  the  following  pros  and  cons: 

Pros 


•  Using  HRL  scales  learning  to  problems  with  large  state  spaces  by  using  the  task 
structure  to  restrict  the  space  of  policies. 


Ill 


•  Cooperation  among  agents  is  faster  and  more  efficient  as  agents  learn  joint-action 
values  only  at  cooperative  subtasks  usually  located  at  the  high  level(s)  of  abstraction 
and  do  not  get  confused  by  low-level  details. 

•  Since  high-level  subtasks  can  take  a  long  time  to  complete,  communication  is  needed 
only  fairly  infrequently. 

•  The  complexity  of  the  problem  is  reduced  by  storing  only  the  local  state  information 
by  each  agent.  It  is  due  to  the  fact  that  each  agent  can  often  get  a  rough  idea  of  the 
state  of  the  other  agents  just  by  knowing  about  their  high-level  actions. 

Cons 


•  The  learned  policy  would  not  be  optimal  if  agents  need  to  coordinate  at  the  subtasks 
that  have  not  been  defined  as  cooperative.  This  issue  will  be  addressed  in  one  of  the 
AGV  experiments  in  Section  6.4,  by  extending  the  joint-action  model  to  the  lower 
levels  of  the  hierarchy.  Although  this  extension  provides  the  cooperation  required  at 
the  lower  levels,  it  increases  the  number  of  parameters  to  be  learned  and  as  a  result 
the  complexity  of  the  learning  problem. 

•  If  communication  is  costly,  this  method  might  not  find  an  appropriate  policy  for  the 
problem.  We  address  this  issue  in  Section  6.5  by  including  communication  decisions 
in  the  model.  If  communication  is  cheap,  agents  leam  to  cooperate  with  each  other, 
and  if  communication  is  expensive,  agents  prefer  to  make  decision  only  based  on 
their  local  view  of  the  overall  problem. 

•  Storing  only  local  state  information  by  agents  causes  sub-optimality  in  general.  On 
the  other  hand,  including  the  state  of  the  other  agents  dramatically  increases  the 
complexity  of  the  learning  problem  and  has  its  own  inefficacy.  We  do  not  explicitly 
address  this  problem  in  this  dissertation. 
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The  value  function  decomposition  described  in  Section  3.5  relies  on  a  key  principle: 
the  reward  function  for  the  parent  task  is  the  value  function  of  the  child  task  (see  Equations 
3.4  and  3.5).  Now,  we  show  how  the  single-agent  two-part  value  function  decomposition 
described  in  Section  3.5  can  be  modified  to  formulate  the  joint-value  function  for  coopera¬ 
tive  subtasks.  In  our  hierarchical  multi-agent  model,  we  configure  cooperative  subtasks  to 
store  the  joint  completion  function  values. 

Definition  6.4:  The  joint  completion  function  for  agent  j,  C3(i,  s,  a1, . . .  ,  a-7-1,  a3+1, . . .  , 
an,a3),  is  the  expected  discounted  cumulative  reward  of  completing  cooperative  subtask 
Mi  after  taking  subtask  a3  in  state  s  while  other  agents  performing  subtasks  ak,\/k  € 
{1 , ,n}  ,  k  7^  j.  The  reward  is  discounted  back  to  the  point  in  time  where  a3  begins 
execution.  □ 


In  this  definition,  M%  is  a  cooperative  subtask  at  level  l  of  the  hierarchy  and  (a1, . . .  ,  an) 
is  a  joint-action  in  the  action  set  of  Mi.  Each  individual  action  in  this  joint-action  belongs 
to  Ui.  More  precisely,  the  decomposition  equations  used  for  calculating  the  projected  value 
and  action- value  function  for  cooperative  subtask  Mi  of  agent  j  have  the  following  form: 


VJ(i,  s,ax, . . .  ,  a 3  1,aj+1, ...  ,an)  =  Q3(i ,  s,a\  ...,a3  \  aJ+i, . . . ,  an ,  p\(s)) 


(6.1) 


Q3(i,  s,  a1, . . .  ,  a 3  1,  a3+1, ...  ,an,a3)  =  V3 (a3  ,s)  +  C3 (i,  s,  a1 , . . .  ,  a3  3,aJ+3, . . .  ,  a"  ,  a3) 


1  J-1  J+i 


One  important  point  to  note  in  this  equation  is  that  if  subtask  a3  is  itself  a  cooperative  sub- 
task  at  level  /  + 1  of  the  hierarchy,  its  projected  value  function  is  defined  as  a  joint  projected 
value  function  V3(a3,  s,  a1, . . .  ,  a3-1,  a3+1, . . .  ,  an),  where  a1,...  ,  a3-1,  a3+1, ...  ,  an 
belong  to  Ui+\.  In  this  case,  in  order  to  calculate  V3  (a3 .  s )  for  Equation  6.1,  we  marginalize 

V3(a3,  s,  a1, . . .  ,  a3-1,  a3+1, . . .  ,  an )  over  a1, . . .  ,  a3+1, ...  ,  an. 
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We  illustrate  the  above  projected  joint-value  function  decomposition  using  the  trash 
collection  task.  The  two-part  value  function  decomposition  for  agent  A 1  at  root  has  the 
following  form: 

Ql(root ,  s,  collect  trash  at  T  2  ,  collect  trash  at  T 1)  =  V1  (collect  trash  at  T 1,  s) 

+C1(root,  s,  collect  trash  at  T2,  collect  trash  at  T 1) 


which  represents  the  value  of  agent  A 1  performing  collect  trash  at  T1  in  the  context  of  the 
overall  task  (root),  when  agent  A2  is  executing  collect  trash  at  T2.  Note  that  this  value  is 
decomposed  into  the  projected  value  of  collect  trash  at  T1  subtask  (the  V  term),  and  the 
completion  value  of  the  remainder  of  the  root  task  (the  C  term). 

Given  a  hierarchical  decomposition  for  any  problem,  we  need  to  find  the  highest  level 
subtasks  at  which  decomposition  Equation  6.1  provides  a  sufficiently  good  approximation 
of  the  true  value.  For  the  problems  used  in  the  experiments  of  this  chapter,  coordination 
only  at  the  highest  level  of  the  hierarchy  is  a  good  compromise  between  achieving  a  de¬ 
sirable  performance  and  reducing  the  number  of  joint-state-action  values  that  need  to  be 
learned.  Hence,  we  define  root  as  a  cooperative  subtask  and  thus  the  highest  level  of  the 
hierarchy  as  a  cooperation  level  in  these  experiments.  We  extend  coordination  to  lower 
levels  of  the  hierarchy  by  defining  cooperative  subtasks  at  levels  below  root  in  one  of  the 
experiments  of  Section  6.4. 

6.3  A  Hierarchical  Multi- Agent  Reinforcement  Learning  Algorithm 

In  this  section,  we  use  the  hierarchical  multi-agent  RL  framework  described  in  Section 
6.2  and  present  a  hierarchical  multi-agent  RL  algorithm,  called  Cooperative  HRL.  The 
pseudo  code  for  this  algorithm  is  shown  in  Algorithm  4  at  the  end  of  this  chapter.  In  the 
Cooperative  HRL,  V  and  C  values  can  be  learned  through  a  standard  TD-leaming  method 
based  on  sample  trajectories.  One  important  point  to  note  is  that  since  non-primitive  sub¬ 
tasks  are  temporally  extended  in  time,  the  update  rules  for  C  values  used  in  this  algorithm 
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are  based  on  the  SMDP  model.  In  this  algorithm,  an  agent  starts  from  the  wot  task  and 
chooses  a  subtask  till  it  reaches  a  primitive  action  i.  It  executes  primitive  action  i  in  state  s, 
receives  reward  r  and  observes  resulting  state  s',  the  value  function  V  of  primitive  subtask3 
Mt  is  updated  using: 


Vt+i (i,  s)  =  [1  —  at(i)]Vt(i,  s)  +  at(i)r 

where  at(i)  is  the  learning  rate  for  subtask  Mi  at  time  t.  This  parameter  should  be  gradually 
decreased  to  zero  in  time  limit. 

Whenever  a  subtask  terminates,  the  C  values  are  updated  for  all  states  visited  during  the 
execution  of  that  subtask.  Assume  an  agent  is  executing  a  non-primitive  subtask  and  is 
in  state  s,  then  while  subtask  Mi  does  not  terminate,  it  chooses  subtask  Ma  according  to  the 
current  exploration  policy  (softmax  or  e-greedy  with  respect  to  Pi(s)).  If  subtask  Ma  takes 
N  primitive  steps  and  terminates  in  state  s',  the  corresponding  C  value  is  updated  using 

Ct+i(i,  s,  a)  =  [1  -  at(i)]Ct(i,  s,  a)  +  at{i)^N[Ct{i,  s',  a*)  +  Vt(a*,  s')]  (6.2) 

where  a*  =  arg maxa,eA.  [ Ct(i,  s',  a')  +  Vt(a',  s')]. 

The  V  values  in  Equation  6.2  are  calculated  using  the  following  equation: 

{maxaeA  Q(i,  s,  a)  if  Mi  is  a  non-primitive  subtask, 

^2s'£Si  P(s'\s>  i)r(s,  i)  if  Mi  is  a  primitive  action. 

Similarly,  when  agent  j  completes  execution  of  subtask  aJ  e  /l,,  the  joint  completion 
function  C  of  cooperative  subtask  Mt  located  at  level  l  of  the  hierarchy  is  updated  for  all 
the  states  visited  during  the  execution  of  subtask  a-7  using 

3We  do  not  use  V  here,  since  projected  and  hierarchical  value  functions  are  the  same  for  primitive  actions. 
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Cf+i(i,  s,  a1, 


.a-7  1,aJ+1 


•  •  •  ,an,aJ)  =  [1  -  aJt(i)\Cj  (i,  s,  a1, . . 
+<4  (*) TiV[C't  (*,  a1, . . .  ,  aJ_1,  aJ+1, 


,  a-7  aJ"+1, . . .  ,  an,  a-7) 

. .  ,  an,  a*)  +  Vf(a* ,s')\ 
(6.3) 


where  a*  =  arg  maxa,eA.  [Cj?  (i,  s',  a1,...  ,  a-7-1,  aJ+1, ...  ,  an,  a')  +  VtJ(a',  s')],  a1, . . .  , 
a-7-1,  aJ+1, ...  ,  an  and  a1, . . .  ,  a-7-1,  a-7+1, ...  ,  an  are  actions  in  Ui  being  performed  by 
the  other  agents  when  agent  j  is  in  states  s  and  s'  respectively. 

Equation  6.3  indicates  that  in  addition  to  the  states  visited  during  the  execution  of  a 
subtask  in  Ui  (s  and  s'),  an  agent  must  store  the  actions  in  Ui  being  performed  by  all  the 
other  agents  (a1, . . .  ,  a-7-1,  aJ+1, ...  ,an  in  state  s  and  a1,...  ,  a-7-1,  a-7-1"1, ...  ,  an  in  state 
s').  Sequence  Seq  is  used  for  this  purpose  in  Algorithm  4. 

6.4  Experimental  Results  for  the  Cooperative  HRL  Algorithm 

In  this  section,  we  demonstrate  the  performance  of  the  Cooperative  HRL  algorithm 
proposed  in  Section  6.3  using  a  four-agent  AGV  scheduling  task.  In  this  experiment,  we 
first  provide  a  brief  overview  of  the  domain,  then  apply  the  Cooperative  HRL  algorithm 
to  the  problem,  and  finally  compare  its  performance  with  other  algorithms,  such  as  selfish 
multi-agent  HRL  (where  each  agent  acts  independently  and  leams  its  own  optimal  policy), 
single-agent  HRL,  and  flat  Q-Leaming. 

Ligure  6.2  shows  the  layout  of  the  AGV  scheduling  domain.  Ml  to  M4  show  work¬ 
stations  in  this  environment.  Parts  of  type  i  have  to  be  carried  to  the  drop-off  station  at 
workstation  i,  D-i,  and  the  assembled  parts  brought  back  from  pick-up  stations  of  work¬ 
stations,  Pi  s,  to  the  warehouse.  The  AGV  travel  is  unidirectional  (as  the  arrows  show). 
This  task  is  decomposed  using  the  task  graph  in  Ligure  6.3.  Each  agent  uses  a  copy  of 
this  task  graph.  We  define  root  as  a  cooperative  subtask  and  the  highest  level  of  the  hier¬ 
archy  as  a  cooperation  level.  Therefore,  all  subtasks  at  the  second  level  of  the  hierarchy 
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Algorithm  4  The  Cooperative  HRL  algorithm. 

1:  Function  Cooperative-HRLf  Agent  j,  Task  Mt  at  the  /th  level  of  the  hierarchy.  State  s ) 

2:  let  Seq  =  { }  be  the  sequence  of  (state-visited,  actions  in  Ufc=i  14  being  performed  by  the  other  agents ) 
while  executing  Mt  /*  L  is  the  number  of  levels  in  the  hierarchy  */ 

3:  if  Mi  is  a  primitive  action  then 

4:  execute  action  i  in  state  s,  receive  reward  r(s,  i)  and  observe  state  s' 

5:  V3+l{i,s)< —  [1  -  aJt(i)]VtJ(i,s)  +  a3t(i)r(s,i) 

6:  push  (state  s,  actions  hi  {Ui\l  is  a  cooperation  level}  being  performed  by  the  other  agents)  onto  the 

front  of  Seq 

7:  else  /*  Mj  is  a  non-primitive  subtask  */ 

8:  while  Mj  has  not  terminated  do 

9:  if  Mi  is  a  cooperative  subtask  then 

10:  choose  action  a3  according  to  the  current  exploration  policy 

/4(s>aV--  ,a3~1,a3+1,...  ,an) 

11:  let  ChildSeq  =  Cooperative-HRL(Mj,  a3,  s),  where  ChildSeq  is  the  sequence  of  (state-visited, 

actions  in  U^=i  14  being  performed  by  the  other  agents)  while  executing  action  a 3 
12:  observe  result  state  s'  and  a1, . . .  ,a3~1,  a3+1, ...  ,an  actions  in  Ui  being  performed  by  the 

other  agents 

13:  let  a*  =  argmaxa,6A.  [C3t{i,  s',  a1, . . .  ,a3~1 , d3+1 , . . .  ,an,a')  +  Vt3{a',s')\ 

14:  let  N  =  0 

15:  for  each  (s,  a1, . . .  ,  a-7-1,  a3+1, . . .  ,  a”)  in  ChildSeq  from  the  beginning  do 

16:  N  =  N  +1 

17:  C3t+1{i,s,a},...  ,oJ_1,aJ+1, . . .  ,an,aJ)  < — 

[1  —  a{(i)]C{ (■ i ,  s,  a1, . . .  ,  a-7-1,  o-7+1, . . .  ,  an ,  a3)+ 
at(*)7  N[Ct(i,s',a1,.. .  ,  dJ~1,dJ+1, . . .  ,an,a*)  +  Vf  (a*,  s')] 

18:  end  for 

19:  else  /*  Mi  is  not  a  cooperative  subtask  */ 

20:  choose  action  a 7  according  to  the  current  exploration  policy  p,}  (s) 

21:  let  ChildSeq  =  Cooperative-HRL(Mj,  af  s),  where  ChildSeq  is  the  sequence  of  (state-visited, 

actions  in  Ufc=i  Ur  being  performed  by  the  other  agents)  while  executing  action  a 3 
22:  observe  result  state  s' 

23:  let  a*  =  argma xa,eA.[C^(i,s' ,a')  +  Vf(a',s')\ 

24:  let  A7  =  0 

25:  for  each  state  s  in  ChildSeq  from  the  beginning  do 

26:  N  =  N  +  1 

27:  C?+1(i,  s ,  a3)  < —  [1  -  aJt{i)\C3t  (i,  s,  a3)  +  a{  {i)jN[CJt  (i,  s',  a*)  +  Vt3(a*,  s')] 

28:  end  for 

29:  end  if 

30:  append  ChildSeq  onto  the  front  of  Seq 

31:  s  =  s' 

32:  end  while 

33:  end  if 

34:  return  Seq 

35:  end  Cooperative-HRL 
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{DM  1, . . .  ,  DM4,  DAI , . . .  ,  DAA)  belong  to  set  U\.  Coordination  skills  among  agents 
are  learned  by  using  joint-action  values  at  the  highest  level  of  the  hierarchy  as  described  in 
Section  6.3. 
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Figure  6.2.  A  multi-agent  AGV  scheduling  domain.  There  are  four  AGVs  (not  shown) 
which  carry  raw  materials  and  finished  parts  between  machines  and  the  warehouse. 


The  state  of  the  environment  consists  of  the  number  of  parts  in  the  pick-up  and  drop-off 
stations  of  each  machine,  and  whether  the  warehouse  contains  parts  of  each  of  the  four 
types.  In  addition,  each  agent  keeps  track  of  its  own  location  and  status  as  a  part  of  its  state 
space.  Thus,  in  the  fiat  case,  the  state  space  consists  of  100  locations,  8  buffers  of  size  3,  9 
possible  states  of  AGV  (carrying  parti, . . .  ,  carrying  assembly  1, . . .  ,  empty),  and  2  values 
for  each  part  in  the  warehouse,  i.e.,  100  x  48  x  9  x  24  k  109  states.  The  state  abstraction 
helps  in  reducing  the  state  space  considerably.  Only  the  relevant  state  variables  are  used 
while  storing  the  completion  functions  in  each  node  of  the  task  graph.  For  example,  for 
the  navigation  subtasks,  only  the  location  state  variable  is  relevant,  and  this  subtask  can 
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NavUnload  :  Navigation  to  Unload  Deck 

Figure  6.3.  Task  graph  for  the  AGV  scheduling  task. 


be  learned  with  100  values.  Hence,  for  each  high-level  subtask  (DM  1, . . .  ,  DMA),  the 
number  of  relevant  states  would  be  100  x  9  x  4  x  2  =  7,  200,  and  for  each  high-level 
subtask  (DAI, . . .  ,  DAA),  the  number  of  relevant  states  would  be  100  x  9  x  4  =  3,  600. 
This  state  abstraction  gives  us  a  compact  way  of  representing  the  C  and  V  functions,  and 
speeds  up  the  algorithm. 

In  the  experiments  of  this  section,  we  assume  that  there  are  four  agents  (AGVs)  in  the 
environment.  The  experimental  results  were  generated  with  the  following  model  parame¬ 
ters.  The  inter-arrival  time  for  parts  at  the  warehouse  is  uniformly  distributed  with  a  mean 
of  4  sec  and  variance  of  1  sec.  The  percentage  of  Parti,  Part2,  Part3,  and  Part4  in  the  part 
arrival  process  are  20,  28,  22,  and  30  respectively.  The  time  required  for  assembling  the 
various  parts  is  normally  distributed  with  means  15,  24,  24,  and  30  sec  for  Parti,  Part2, 
Part3,  and  Part 4  respectively,  and  variance  2  sec.  The  execution  time  of  primitive  actions 
(right,  left,  forward,  load,  and  unload )  is  normally  distributed  with  mean  1000  fi- sec  and 
variance  50  //-sec.  The  execution  time  for  the  idle  action  is  also  normally  distributed  with 
mean  1  sec  and  variance  0.1  sec.  Table  6.1  summarizes  the  values  of  the  model  parameters 
used  in  the  experiments  of  this  section.  In  this  task,  each  experiment  was  conducted  five 
times  and  the  results  were  averaged. 
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Parameter 

Distribution 

Mean  (sec) 

Variance  (sec) 

Idle  Action 

Normal 

1 

0.1 

Primitive  Actions 

Normal 

0.001 

0.00005 

Assembly  Time  for  Parti 

Normal 

15 

2 

Assembly  Time  for  Part2 

Normal 

24 

2 

Assembly  Time  for  Part3 

Normal 

24 

2 

Assembly  Time  for  Part4 

Normal 

30 

2 

Inter- Arrival  Time  for  Parts 

Uniform 

4 

1 

Table  6.1.  Model  parameters  for  the  multi-agent  AGV  scheduling  task. 


Figure  6.4  shows  the  throughput  of  the  system  for  the  three  algorithms,  single-agent 
HRL,  selfish  multi-agent  HRL,  and  Cooperative  HRL.  As  seen  in  Figure  6.4,  agents  leam 
a  little  faster  initially  in  the  selfish  multi-agent  method,  but  after  some  time  the  algorithm 
results  in  sub-optimal  performance.  This  is  due  to  the  fact  that  two  or  more  agents  select  the 
same  action,  but  once  the  first  agent  completes  the  task,  the  other  agents  might  have  to  wait 
for  a  long  time  to  complete  the  task,  due  to  the  constraints  on  the  number  of  parts  that  can 
be  stored  at  a  particular  place.  The  system  throughput  achieved  using  the  Cooperative  HRL 
method  is  higher  than  the  single-agent  HRL  and  the  selfish  multi-agent  HRL  algorithms. 
This  difference  is  even  more  significant  in  Figure  6.5,  when  the  primitive  actions  have 
longer  execution  time,  almost  of  the  average  assembly  time  (the  mean  execution  time 
of  primitive  actions  is  2  sec). 

Figure  6.6  shows  the  results  from  an  implementation  of  the  single-agent  flat  Q-Learning 
with  the  buffer  capacity  at  each  station  set  at  1.  As  can  be  seen  from  the  plot,  the  flat  algo¬ 
rithm  converges  extremely  slowly.  The  throughput  at  70, 000  sec  has  gone  up  to  only  0.07, 
compared  with  2.6  for  the  hierarchical  single-agent  case.  Figure  6.7  compares  the  Cooper¬ 
ative  HRL  algorithm  with  several  well-known  AGV  scheduling  rules,  highest  queue  first, 
nearest  station  first,  and  first  come  first  serve,  showing  clearly  the  improved  performance 
of  the  HRL  method. 

So  far  in  our  experiments  in  the  AGV  domain,  we  only  defined  root  as  a  cooperative 
subtask.  Now  in  our  last  experiment  in  this  domain,  in  addition  to  root,  we  define  navi- 
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Figure  6.4.  This  figure  shows  that  the  Cooperative  HRL  algorithm  outperforms  both  the 
selfish  multi-agent  HRL  and  the  single-agent  HRL  algorithms  when  the  AGV  travel  time 
and  load/unload  time  are  very  much  less  compared  to  the  average  assembly  time. 


Figure  6.5.  This  figure  compares  the  Cooperative  HRL  algorithm  with  the  selfish  multi¬ 
agent  HRL,  when  the  AGV  travel  time  and  load/unload  time  are  of  the  average  assem¬ 
bly  time. 
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Flat  Q-Learning 


Figure  6.6.  A  flat  Q-Leamer  learns  the  AGV  domain  extremely  slowly  showing  the  need 
for  using  a  hierarchical  task  structure. 
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Figure  6.7.  This  plot  shows  that  the  Cooperative  HRL  algorithm  outperforms  three  well- 
known  widely  used  industrial  heuristics  for  AGV  scheduling. 
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gation  sub  tasks  at  the  third  level  of  the  hierarchy  as  cooperative  subtasks.  Therefore,  the 
third  level  of  the  hierarchy  is  also  a  cooperation  level  and  its  cooperation  set  contains  all 
navigation  subtasks  at  that  level  (see  Figure  6.3).  We  configure  the  root  and  the  third  level 
navigation  subtasks  to  represent  joint-actions.  Figure  6.8  compares  the  performance  of 
the  system  in  these  two  cases.  When  the  navigation  subtasks  are  configured  to  represent 
joint-actions,  learning  is  considerably  slower  (since  the  number  of  parameters  is  increased 
significantly)  and  the  overall  performance  is  not  better.  The  lack  of  improvement  is  due  in 
part  to  the  fact  that  the  AGV  travel  is  unidirectional,  as  shown  in  Figure  6.2,  thus  coordi¬ 
nation  at  the  navigation  level  does  not  improve  the  performance  of  the  system.  However, 
there  exist  problems  that  adding  joint-actions  in  multiple  levels  will  be  worthwhile,  even  if 
convergence  is  slower,  due  to  better  overall  performance. 


Figure  6.8.  This  plot  compares  the  performance  of  the  Cooperative  HRL  algorithm  with 
cooperation  at  the  top  level  of  the  hierarchy  vs.  cooperation  at  the  top  and  third  levels  of 
the  hierarchy. 
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6.5  Hierarchical  Multi-Agent  RL  with  Communication  Decisions 

Communication  is  used  by  agents  to  obtain  local  information  of  their  teammates  by 
paying  a  certain  cost.  The  Cooperative  HRL  algorithm  described  in  Section  6.3  works 
under  three  important  assumptions,  free,  reliable,  and  instantaneous  communication,  i.e., 
communication  cost  is  zero,  no  message  is  lost  in  the  environment,  and  each  agent  has 
enough  time  to  receive  information  about  its  teammates  before  taking  its  next  action.  Since 
communication  is  free,  as  soon  as  an  agent  selects  an  action  at  a  cooperative  subtask , 
it  broadcasts  it  to  the  team.  Using  this  simple  rule,  and  the  fact  that  communication  is 
reliable  and  instantaneous,  whenever  an  agent  is  about  to  choose  an  action  at  an  Zth  level 
cooperative  subtask,  it  knows  the  subtasks  in  Ui  being  performed  by  all  its  teammates. 

However,  communication  can  be  costly  and  unreliable  in  real-world  problems.  When 
communication  is  not  free,  it  is  no  longer  optimal  for  a  team  that  agents  always  broadcast 
actions  taken  at  their  cooperative  subtasks  to  their  teammates.  Therefore,  agents  must 
leam  to  optimally  use  communication  by  taking  into  account  its  long  term  return  and  its 
immediate  cost.  In  the  remainder  of  this  chapter,  we  examine  the  case  that  communication 
is  not  free,  but  still  assume  that  it  is  reliable  and  instantaneous.  In  this  section,  we  first 
describe  the  communication  framework  and  then  illustrate  how  we  extend  the  Cooperative 
HRL  algorithm  to  include  communication  decisions  and  propose  a  new  algorithm,  called 
COM- Cooperative  HRL.  The  goal  of  this  algorithm  is  to  learn  a  hierarchical  policy  (a  set 
of  policies  for  all  subtasks  including  the  communication  subtasks)  to  maximize  the  team 
utility  given  the  communication  cost.  Finally,  in  Section  6.6,  we  demonstrate  the  efficacy  of 
the  COM-Cooperative  HRL  algorithm  as  well  as  the  relation  between  the  communication 
cost  and  the  learned  communication  policy  using  a  multi-agent  taxi  domain. 

6.5.1  Communication  Among  Agents 

Communication  usually  consists  of  three  steps:  send,  answer,  and  receive.  At  the  send 
step  ts,  agent  j  decides  if  communication  is  necessary,  performs  a  communication  ac- 
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tion,  and  sends  a  message  to  agent  i.  At  the  answer  step  ta  >  ts,  agent  i  receives  the 
message  from  agent  j,  updates  its  local  information  using  the  contents  of  the  message  (if 
necessary),  and  sends  back  the  answer  (if  required).  At  the  receive  step  tr  >  ta,  agent 
j  receives  the  answer  of  its  message,  updates  its  local  information,  and  decides  on  which 
non-communicative  action  to  execute.  Generally  there  are  two  types  of  messages  in  a 
communication  framework:  request  and  inform.  For  simplicity,  we  suppose  that  relative 
ordering  of  messages  do  not  change,  which  means  that  for  two  communication  actions  C\ 
and  c2,  if  ts(ci)  <  fs(c2)  then  ta(ci)  <  ta(c2)  and  tr(ci)  <  fr(c2).  The  following  three 
types  of  communication  actions  are  commonly  used  in  a  communication  model: 

•  Tell(j ,  i ):  agent  j  sends  an  inform  message  to  agent  t. 

•  Ask(j ,  i):  agent  j  sends  a  request  message  to  agent  i,  which  is  answered  by  agent  i 
with  an  inform  message. 

•  Sync(j,  i ):  agent  j  sends  an  inform  message  to  agent  i,  which  is  answered  by  agent 
i  with  an  inform  message. 

In  the  Cooperative  HRL  algorithm  described  in  Section  6.3,  we  assume  free,  reliable 
and  instantaneous  communication.  Hence,  the  communication  protocol  of  this  algorithm 
is  as  follows:  whenever  an  agent  chooses  an  action  at  a  cooperative  subtask,  it  executes  a 
Tell  communication  action  and  sends  its  selected  action  as  an  inform  message  to  all  other 
agents.  As  a  result,  when  an  agent  is  going  to  choose  an  action  at  an  Zth  level  cooperative 
subtask,  it  knows  actions  being  performed  by  all  other  agents  in  [/;.  Tell  and  inform  are  the 
only  communication  action  and  type  of  message  used  in  the  communication  protocol  of  the 
Cooperative  HRL  algorithm. 

6.5.2  A  Hierarchical  Multi-Agent  RL  Algorithm  with  Communication  Decisions 

When  communication  is  costly  in  the  Cooperative  HRL  algorithm,  it  is  no  longer  op¬ 
timal  for  the  team  that  each  agent  broadcasts  its  action  to  all  its  teammates.  In  this  case, 
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each  agent  must  leam  to  optimally  use  the  communication.  To  address  the  communication 
cost  in  the  COM-Cooperative  HRL  algorithm,  we  add  a  communication  level  to  the  task 
graph  of  the  problem  below  each  cooperation  level ,  as  shown  in  Figure  6.9  for  the  trash 
collection  task.  In  this  algorithm,  when  an  agent  is  going  to  make  a  decision  at  an  /th  level 
cooperative  subtask,  it  first  decides  whether  to  communicate  (takes  Communicate  action) 
with  the  other  agents  to  acquire  their  actions  in  Ui,  or  do  not  communicate  (takes  Not- 
Communicate  action)  and  selects  its  action  without  inquiring  new  information  about  its 
teammates.  Agents  decide  about  communication  by  comparing  the  expected  value  of  com¬ 
munication  plus  the  communication  cost,  Q (Parent (Com),  s,  Com)+ComCost,  with  the 
expected  value  of  not  communicating  with  the  other  agents,  Q  (Parent  (NotCom),  s,  Not— 
Com).  If  agent  j  decides  not  to  communicate,  it  chooses  an  action  like  a  selfish  agent 
by  using  its  action-value  (not  joint-action-value)  function  Qj (NotCom,  s,  a),  where  a  G 
Children(NotCom).  When  it  decides  to  communicate,  it  first  takes  communication  action 
Ask(j,  i),  Vi  G  (1, . . .  ,  j  —  l,j  +  1, . . .  ,  n },  where  n  is  the  number  of  agents,  and  sends 
a  request  message  to  all  other  agents.  Other  agents  reply  by  taking  communication  action 
Tell(i ,  j)  and  send  their  action  in  Ui  as  an  inform  message  to  agent  j.  Then  agent  j  uses  its 
joint-action-value  (not  action-value)  function  Qj (Com,  s,  a1 , . . .  ,aj~1,aj+1,  ...  ,an,a), 
a  G  Children(Com)  to  select  its  next  action  in  Ui.  For  instance,  in  the  trash  collection 
task,  when  agent  A 1  dumps  trash  and  is  going  to  move  to  one  of  the  two  trash  cans,  it 
should  first  decide  whether  to  communicate  with  agent  A2  in  order  to  inquire  its  action  in 
f/i  =  {collect  trash  at  T 1,  collect  trash  at  T 2}  or  not.  To  make  a  communication  deci¬ 
sion,  agent  Al  compares  Q1(Root,  s  ,  NotCom)  with  Q1(Root,  s,  Com )  +  ComCost.  If  it 
chooses  not  to  communicate,  it  selects  its  action  using  Q1  (NotCom,  s,  a),  where  a  G  U\. 
If  it  decides  to  communicate,  after  acquiring  the  action  of  agent  A2  in  U\,  aA2,  it  selects  its 
own  action  using  Q1(Com,  s,  aA2,  a),  where  a  and  aA2  both  belong  to  Ui. 

In  the  COM-Coopercitive  HRL,  we  assume  that  when  an  agent  decides  to  communicate, 
it  communicates  with  all  other  agents  as  described  above.  We  can  make  the  model  more 
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Communication  Level 


U  i  =  Children  of  the  top-level 
Cooperative  Subtask  (Root) 


Figure  6.9.  Task  graph  of  the  trash  collection  problem  with  communication  actions. 


complicated  by  making  decision  about  communication  with  each  individual  agent.  In  this 
case,  the  number  of  communication  actions  would  be  Cxn_x  +  C2n_  ,  +  . . .  +  C™z\,  where 
C is  the  number  of  distinct  combinations  selecting  q  out  of  p  agents.  For  instance,  in  a 
three-agent  case,  communication  actions  for  agent  1  would  be  communicate  with  agent  2, 
communicate  with  agent  3,  and  communicate  with  both  agents  2  and  3.  It  increases  the 
number  of  communication  actions  and  therefore  the  number  of  parameters  to  be  learned. 
However,  there  are  methods  to  reduce  the  number  of  communication  actions  in  real-world 
applications.  For  instance,  we  can  cluster  agents  based  on  their  role  in  the  team  and  assume 
each  cluster  as  a  single  entity  to  communicate  with.  It  reduces  n  from  the  number  of  agents 
to  the  number  of  clusters. 

In  the  COM-Cooperative  HRL  algorithm.  Communicate  subtasks  are  configured  to  store 
joint  completion  function  values,  and  Not-Communicate  subtasks  are  configured  to  store 
completion  function  values.  The  joint  completion  function  for  agent  j,  C 3  (Corn,  s,  a1 , . . .  , 
a3~x,  a3+x , ...  ,  an,  a1)  is  defined  as  the  expected  discounted  reward  of  completing  subtask 
a 3  by  agent  j  in  the  context  of  the  parent  task  Com ,  when  other  agents  performing  sub- 
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tasks  a\\/i  e  {1, . . .  ,  j  —  1,  j  +  1, . . .  ,  n}.  In  the  trash  collection  domain,  if  agent  Al 
communicates  with  agent  A2,  its  value  function  decomposition  would  be 

Ql(Com,  s,  Collect  Trash  at  T2,  Collect  Trash  at  Tl)  =  V1( Collect  Trash  at  Tl,  s )  + 
C1(Com,  s,  Collect  Trash  at  T2,  Collect  Trash  at  Tl) 


which  represents  the  projected  value  of  agent  Al  performing  subtask  collect  trash  at  Tl, 
when  agent  A 2  is  executing  subtask  collect  trash  at  T2.  Note  that  this  value  is  decom¬ 
posed  into  the  projected  value  of  subtask  collect  trash  at  Tl  and  the  value  of  completing 
subtask  Parent  (Com )  (here  root  is  the  parent  of  subtask  Com)  after  executing  subtask 
collect  trash  at  Tl.  If  agent  Al  does  not  communicate  with  agent  A2,  its  value  function 
decomposition  would  be 

Q1(NotCom ,  s,  Collect  Trash  at  Tl)  =  Vl( Collect  Trash  at  Tl,  s) 

+  C1(NotCom,  s,  Collect  Trash  at  Tl) 


which  represents  the  projected  value  of  agent  Al  performing  subtask  collect  trash  at  Tl, 
regardless  of  the  action  being  executed  by  agent  A2. 

Again,  the  V  and  C  values  are  learned  through  a  standard  TD-leaming  method  based  on 
sample  trajectories  similar  to  the  one  presented  in  Algorithm  4.  Completion  function  values 
for  an  action  in  Ui  is  updated  when  we  take  an  action  under  Not-Communicate  subtask,  and 
joint  completion  function  values  for  an  action  in  Ui  is  updated  when  it  is  selected  under 
Communicate  subtask.  In  the  later  case,  the  actions  selected  in  Ui  by  the  other  agents  are 
known  as  a  result  of  communication  and  are  used  to  update  the  joint  completion  function 
values. 
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6.6  Experimental  Results  for  the  COM-Cooperative  HRL  Algorithm 

In  this  section,  we  demonstrate  the  performance  of  the  COM-Cooperative  HRL  algo¬ 
rithm  proposed  in  Section  6.5.2  using  a  multi-agent  taxi  problem.  We  also  investigate  the 
relation  between  the  communication  policy  and  the  communication  cost  in  this  domain. 

Consider  a  5-by-5  grid  world  inhabited  by  two  taxis  (T 1  and  T 2)  shown  in  Figure  6.10. 
There  are  four  stations  in  this  domain,  marked  as  B(lue),  G(reen),  R(ed),  and  Y(ellow). 
The  task  is  continuing,  passengers  appear  according  to  a  fixed  passenger  arrival  rate4  at 
these  four  stations  and  wish  to  be  transported  to  one  of  the  other  stations  chosen  randomly. 
Taxis  must  go  to  the  location  of  a  passenger,  pick  up  the  passenger,  go  to  her/his  destina¬ 
tion  station,  and  drop  the  passenger  there.  The  goal  here  is  to  increase  the  throughput  of 
the  system,  which  is  measured  in  terms  of  the  number  of  passengers  dropped  off  at  their 
destinations  per  5,  000  time  steps,  and  to  reduce  the  average  waiting  time  per  passenger. 
This  problem  can  be  decomposed  into  subtasks  and  the  resulting  task  graph  is  shown  in 
Figure  6.10.  Taxis  need  to  learn  three  skills  here.  First,  how  to  do  each  subtask,  such 
as  navigate  to  B,  G,  R,  or  Y,  and  when  to  perform  Pickup  or  Putdown  action.  Second, 
the  order  to  do  the  subtasks,  i.e.,  for  instance  go  to  a  station  and  pickup  a  passenger  be¬ 
fore  heading  to  the  passenger’s  destination.  Finally,  how  to  communicate  and  coordinate 
with  each  other,  i.e.,  if  taxi  T1  is  on  its  way  to  pick  up  a  passenger  at  location  Blue ,  taxi 
T 2  should  serve  a  passenger  at  one  of  the  other  stations.  The  state  variables  in  this  task 
are  the  locations  of  taxis  (25  values  each),  status  of  taxis  (5  values  each,  taxi  is  empty  or 
transporting  a  passenger  to  one  of  the  four  stations),  and  status  of  stations  B,  G,  R,  and  Y 
(4  values  each,  station  is  empty  or  has  a  passenger  whose  destination  is  one  of  the  other 
three  stations).  Thus,  in  the  multi-agent  flat  case,  the  size  of  the  state  space  would  grow 
to  4  x  106.  The  size  of  the  0  table  is  this  number  multiplied  by  the  number  of  primitive 
actions  10,  which  is  4  x  10'.  In  the  selfish  multi-agent  HRL  algorithm,  using  state  abstrac- 

4Passenger  arrival  rate  10  indicates  that  on  average,  one  passenger  arrives  at  stations  every  10  time  steps. 
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tion  and  the  fact  that  each  agent  stores  only  its  own  state  variables,  the  number  of  the  C 
and  V  values  to  be  learned  is  reduced  to  2  x  135,  895  =  271,  790,  which  is  135,895  val¬ 
ues  for  each  agent.  In  the  Cooperative  HRL  algorithm,  the  number  of  values  to  be  learned 
would  be  2  x  729, 815  =  1, 459,  630.  Finally  in  the  COM-Cooperative  HRL  algorithm,  this 
number  would  be  2  x  934,  615  =  1,  869,  230.  In  the  COM-Cooperative  HRL ,  we  define 
root  as  a  cooperative  subtask  and  the  highest  level  of  the  hierarchy  as  a  cooperation  level 
as  shown  in  Figure  6.10.  Thus,  root  is  the  only  member  of  the  cooperation  set  at  that  level, 
and  U\  =  Aroot  =  {GetB,  GetG,  GetR,  GetY,  Wait ,  Put}.  The  joint-action  space 
for  root  is  specified  as  the  cross  product  of  the  root  action  set  and  U\.  Finally,  rcontinue 
termination  scheme  is  used  for  joint-action  selection  in  this  domain.  All  the  experiments  in 
this  section  were  repeated  five  times  and  the  results  were  averaged. 
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Figure  6.10.  A  multi-agent  taxi  domain  and  its  associated  task  graph. 


Figures  6.11  and  6.12  show  the  throughput  of  the  system  and  the  average  waiting  time 
per  passenger  for  four  algorithms,  single-agent  HRL,  selfish  multi-agent  HRL,  Cooperative 
HRL,  and  COM-Cooperative  HRL  when  communication  cost  is  zero.5  As  seen  in  Figures 

5The  COM-Cooperative  HRL  uses  the  task  graph  in  Figure  6.10.  The  Cooperative  HRL  uses  the  same 
task  graph  without  the  communication  level. 
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6.11  and  6.12,  Cooperative  HRL  and  COM-Cooperative  HRL  with  ComCost  =  0  have 
better  throughput  and  average  waiting  time  per  passenger  than  selfish  multi-agent  HRL 
and  single-agent  HRL.  The  COM-Cooperative  HRL  learns  slower  than  Cooperative  HRL , 
due  to  more  parameters  to  be  learned  in  this  model.  However,  it  eventually  converges  to 
the  same  performance  as  the  Cooperative  HRL  does. 


Number  of  Steps  (Passenger  Arrival  Rate  =  10)  x  io4 


Figure  6.11.  This  figure  shows  that  the  Cooperative  HRL  and  the  COM-Cooperative  HRL 
with  ComCost  =  0  have  better  throughput  than  the  selfish  multi-agent  HRL  and  the  single¬ 
agent  HRL. 


Figure  6.13  compares  the  average  waiting  time  per  passenger  for  the  multi-agent  self¬ 
ish  HRL  and  the  COM-Cooperative  HRL  with  ComCost  =  0  for  three  different  passenger 
arrival  rates  (5,  10,  and  20).  It  demonstrates  that  as  the  passenger  arrival  rate  becomes 
smaller,  the  coordination  among  taxis  becomes  more  important.  When  taxis  do  not  coordi¬ 
nate,  it  is  possible  that  both  taxis  go  to  the  same  station.  In  this  case,  the  first  taxi  picks  up 
the  passenger  and  the  other  one  returns  empty.  This  case  can  be  avoided  by  incorporating 
coordination  in  the  system.  However,  when  the  passenger  arrival  rate  is  high,  there  is  a 
chance  that  a  new  passenger  arrives  after  the  first  taxi  picked  up  the  previous  passenger  and 
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Number  of  Steps  (Passenger  Arrival  Rate  =  10)  x  104 

Figure  6.12.  This  figure  shows  that  the  average  waiting  time  per  passenger  in  the  Coop¬ 
erative  HRL  and  the  COM-Cooperative  HRL  with  ComCost  =  0  is  less  than  the  selfish 
multi-agent  HRL  and  the  single-agent  HRL. 


before  the  second  taxi  reaches  the  station.  This  passenger  will  be  picked  up  by  the  second 
taxi.  In  this  case,  coordination  would  not  be  as  crucial  as  the  case  when  the  passenger 
arrival  rate  is  low. 

Figure  6.14  demonstrates  the  relation  between  the  communication  policy  and  the  com¬ 
munication  cost.  These  two  figures  show  the  throughput  and  the  average  waiting  time  per 
passenger  for  the  selfish  multi-agent  HRL  and  the  COM-Cooperative  HRL  when  the  com¬ 
munication  cost  equals  0, 1,  5,  and  10.  In  both  figures,  as  the  communication  cost  increases, 
the  performance  of  the  COM-Cooperative  HRL  becomes  closer  to  the  selfish  multi-agent 
HRL.  It  indicates  that  when  communication  is  expensive,  agents  learn  not  to  communicate 
and  to  be  selfish. 
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Figure  6.13.  This  figure  compares  the  average  waiting  time  per  passenger  for  the  selfish 
multi-agent  HRL  and  the  COM-Coopercitive  HRL  with  ComCost  =  0  for  three  different 
passenger  arrival  rates  (5,  10,  and  20).  It  shows  that  coordination  among  taxis  becomes 
more  crucial  as  the  passenger  arrival  rate  becomes  smaller. 
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4 

Number  of  Steps  (Passenger  Arrival  Rate  =  5)  x  10 


Figure  6.14.  This  figure  shows  that  as  communication  cost  increases,  the  throughput  (top) 
and  the  average  waiting  time  per  passenger  (bottom)  of  the  COM-Cooperative  HRL  be¬ 
come  closer  to  the  selfish  multi-agent  HRL.  It  indicates  that  agents  leam  to  be  selfish  when 
communication  is  expensive. 
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6.7  Summary  and  Future  Work 

In  this  chapter,  we  studied  methods  for  learning  to  communicate  and  act  in  cooperative 
multi-agent  systems  using  hierarchical  reinforcement  learning.  The  key  idea  underlying 
our  approach  is  that  coordination  skills  are  learned  much  more  efficiently  if  agents  have  a 
hierarchical  representation  of  the  task  structure.  The  use  of  hierarchy  speeds  up  learning 
in  multi-agent  domains  by  making  it  possible  to  leam  coordination  skills  at  the  level  of 
subtasks  instead  of  primitive  actions.  A  further  advantage  of  this  approach  over  flat  learning 
methods  is  that,  since  high-level  subtasks  take  a  long  time  to  complete,  communication  is 
needed  fairly  infrequently.  We  proposed  two  new  cooperative  multi-agent  HRL  algorithms, 
Cooperative  HRL  and  COM-Cooperative  HRL  using  the  above  idea.  In  both  algorithms, 
agents  are  homogeneous,  i.e.,  use  the  same  task  decomposition,  learning  is  decentralized, 
and  each  agent  learns  three  interrelated  skills:  how  to  perform  subtasks,  which  order  to  do 
them  in,  and  how  to  coordinate  with  other  agents. 

In  Cooperative  HRL ,  we  assume  communication  is  free  and  therefore  agents  do  not 
need  to  decide  if  communication  with  their  teammates  is  necessary.  We  demonstrate  the 
efficacy  of  this  algorithm  using  a  four-agent  AGV  scheduling  problem.  We  compare  the 
performance  of  the  Cooperative  HRL  algorithm  with  other  algorithms  such  as  selfish  multi¬ 
agent  HRL,  single-agent  HRL,  and  flat  Q-learning  in  this  domain.  We  also  show  that 
Cooperative  HRL  outperforms  widely  used  industrial  heuristics,  such  as  “first  come  first 
serx’e”,  “highest  queue  first” ,  and  “nearest  station  first” . 

In  COM-Cooperative  HRL ,  we  address  the  issue  of  rational  communicative  behavior 
among  autonomous  agents.  The  goal  is  to  leam  both  action  and  communication  policies 
that  together  optimize  the  task  given  the  communication  cost.  This  algorithm  is  an  exten¬ 
sion  of  Cooperative  HRL  by  including  communication  decisions  in  the  model.  We  study 
the  empirical  performance  of  the  COM-Cooperative  HRL  algorithm  as  well  as  the  relation 
between  the  communication  cost  and  the  communication  policy  using  a  multi-agent  taxi 
problem. 
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There  are  a  number  of  directions  for  future  work  which  can  be  briefly  outlined.  An 
immediate  question  that  arises  is  to  define  the  classes  of  cooperative  multi-agent  problems 
in  which  the  proposed  algorithms  converge  to  a  good  approximation  of  optimal  policy.  The 
experiments  of  this  chapter  show  that  the  effectiveness  of  these  algorithms  is  most  apparent 
in  tasks  where  agents  rarely  interact  at  low  levels  (for  example  in  the  trash  collection  task, 
two  robots  may  rarely  need  to  exit  through  the  same  door  at  the  same  time).  However, 
the  algorithms  can  be  generalized  and  adapted  to  constrained  environments  where  agents 
are  constantly  running  into  one  another  (for  example  ten  robots  in  a  small  room  all  trying 
to  leave  the  room  at  the  same  time)  by  extending  cooperation  to  lower  levels  of  the  hier¬ 
archy.  This  will  result  in  a  much  larger  set  of  action  values  that  need  to  be  learned,  and 
consequently  learning  will  be  much  slower,  as  shown  in  the  AGV  experiment  depicted  in 
Figure  6.8.  A  number  of  extensions  would  be  useful,  from  studying  the  scenario  where 
agents  are  heterogeneous,  to  recognizing  the  high-level  subtasks  being  performed  by  other 
agents  using  a  history  of  observations  (plan  recognition  and  activity  modeling)  instead  of 
direct  communication.  In  the  later  case,  we  assume  that  each  agent  can  observe  its  team¬ 
mates  and  uses  its  observations  to  extract  their  high-level  subtasks.  Good  examples  for  this 
approach  are  games  such  as  soccer,  football  or  basketball,  in  which  players  often  extract 
the  strategy  being  performed  by  their  teammates  using  recent  observations  instead  of  direct 
communication.  Saria  and  Mahadevan  (2004)  presented  a  theoretical  framework  for  online 
probabilistic  plan  recognition  in  cooperative  multi-agent  systems.  Their  model  extends  the 
abstract  hidden  Markov  model  (AHMM)  (Bui  et  al.,  2002)  to  cooperative  multi-agent  do¬ 
mains.  We  believe  that  the  model  presented  by  Saria  and  Mahadevan  can  be  combined  with 
the  learning  algorithms  proposed  in  this  chapter  to  reduce  communication  by  learning  to 
recognize  the  high-level  subtasks  being  performed  by  the  other  agents. 

Another  direction  for  future  work  is  to  study  different  termination  schemes  for  compos¬ 
ing  temporally  extended  actions.  We  used  Tc<mtinue  termination  strategy  in  the  algorithms 
proposed  in  this  chapter.  However,  it  would  be  beneficial  to  investigate  rany  and  rau  termi- 
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nation  schemes  in  our  model.  Many  other  manufacturing  and  robotics  problems  can  benefit 
from  these  algorithms.  Combining  the  proposed  algorithms  with  function  approximation 
and  factored  action  models,  which  makes  them  more  appropriate  for  continuous  state  prob¬ 
lems,  is  also  an  important  area  of  research.  In  this  direction,  we  believe  that  the  algorithms 
proposed  in  this  chapter  can  be  combined  with  the  hierarchical  policy  gradient  algorithms 
proposed  in  Chapter  5  to  be  used  in  multi-agent  domains  with  continuous  state  and/or  ac¬ 
tion.  Finally,  studying  those  communication  features  that  have  not  been  considered  in  our 
model  such  as  message  delay  and  probability  of  loss  is  another  fundamental  problem  that 
needs  to  be  addressed. 
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CHAPTER  7 


CONCLUSIONS  AND  FUTURE  WORK 


This  dissertation  demonstrates  that  by  exploiting  domain-specific  properties,  we  can 
design  more  efficient  hierarchical  reinforcement  learning  (HRL)  algorithms  and  scale  up 
HRL  to  more  complex  large-scale  problems.  This  chapter  provides  a  summary  of  the  meth¬ 
ods  and  algorithms  presented  in  this  thesis,  along  with  future  questions  that  remain  open. 

7.1  Summary 

In  this  dissertation,  we  investigated  the  use  of  hierarchy  and  abstraction  as  a  means 
of  solving  complex  sequential  decision  making  problems,  such  as  those  with  continuous 
state  and/or  continuous  action  spaces,  and  domains  with  multiple  cooperative  agents.  We 
developed  several  novel  extensions  to  HRL  and  designed  algorithms  that  are  appropriate 
for  such  problems. 

Recent  years  have  seen  numerous  successes  of  reinforcement  learning  (RL)  approaches 
to  control  and  decision  making  under  uncertainty  (Tesauro,  1994;  Zhang  and  Dietterich, 
1995;  Singh  and  Bertsekas,  1996;  Crites  and  Barto,  1998;  Ng  et  al.,  2004).  However,  the 
existing  RL  methods  suffer  from  the  curse  of  dimensionality:  the  exponential  growth  of 
the  number  of  parameters  to  be  learned  with  the  size  of  any  compact  encoding  of  system 
state  (Bellman,  1957).  Recent  attempts  to  combat  the  curse  of  dimensionality  have  turned 
to  principled  ways  of  exploiting  abstraction  in  RL,  which  leads  naturally  to  hierarchical 
control  architectures  and  associated  learning  algorithms  (Barto  and  Mahadevan,  2003).  Al¬ 
though  HRL  approaches  scale  better  than  flat  RL  methods  to  high  dimensional  domains, 
they  still  suffer  from  the  curse  of  dimensionality.  Moreover,  HRL  methods  have  so  far 
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only  been  studied  in  a  narrow  context:  they  have  been  investigated  for  the  discrete-time 
discounted  reward  SMDP  model;  they  have  all  been  value  function  RL  methods;  and,  they 
have  only  been  studied  in  single-agent  domains.  The  methods  and  algorithms  developed 
in  this  dissertation  expand  the  context  and  scope  of  HRL.  They  use  prior  knowledge  in  a 
principled  way,  and  extend  the  existing  HRL  frameworks  and  algorithms  to  problems  with 
continuous  state  and/or  action  spaces,  and  domains  with  multiple  cooperative  agents. 

In  Chapter  4,  we  presented  new  discrete-time  and  continuous-time  hierarchically  op¬ 
timal  average  reward  RL  (HAR)  and  recursively  optimal  average  reward  RL  (RAR)  algo¬ 
rithms  applicable  to  continuing  tasks,  including  manufacturing,  scheduling,  queuing,  and 
inventory  control.  These  algorithms  are  based  on  the  average-reward  semi-Markov  de¬ 
cision  process  (SMDP)  model,  which  has  been  shown  to  be  more  appropriate  for  a  wide 
class  of  continuing  tasks  than  the  better  studied  discounted  reward  SMDP  model.  The  HAR 
algorithms  aim  to  find  a  hierarchical  policy  within  the  space  of  policies  defined  by  the  hi¬ 
erarchical  decomposition  that  maximizes  the  globed  gain.  The  RAR  algorithms  formulate 
subtasks  in  the  hierarchy  as  continuing  average  reward  problems,  where  the  goal  at  each 
subtask  is  to  maximize  its  gain  given  the  policies  of  its  children.  We  investigated  the  condi¬ 
tions  under  which  the  policy  learned  by  the  RAR  algorithm  at  each  subtask  is  independent 
of  the  context  in  which  it  is  executed  and  therefore  can  be  reused  by  other  hierarchies. 
We  demonstrated  the  performance  of  the  proposed  algorithms  using  two  automated  guided 
vehicle  (AGV)  scheduling  tasks. 

In  Chapter  5,  we  described  HPGRL,  a  family  of  hierarchical  policy  gradient  RL  al¬ 
gorithms  for  learning  in  domains  with  continuous  state  and/or  continuous  action  spaces. 
We  compared  the  performance  of  this  family  of  algorithms  with  a  hierarchical  value  func¬ 
tion  reinforcement  learning  (VFRL)  algorithm  and  a  flat  RL  algorithm  in  a  simple  taxi-fuel 
problem.  The  results  demonstrated  that  the  HPGRL  algorithm  converges  slower  than  the 
hierarchical  VFRL  algorithm.  To  accelerate  learning  in  HPGRL  algorithms,  we  proposed 
a  family  of  hierarchical  hybrid  algorithms  in  which  subtasks  located  at  high  level(s)  of  the 


139 


hierarchy  are  formulated  as  VFRL,  and  subtasks  located  at  low  level(s)  of  the  hierarchy  are 
defined  as  policy  gradient  reinforcement  learning  (PGRL)  problems.  We  used  a  continuous 
state  and  action  ship  steering  task  to  illustrate  this  family  of  algorithms  and  to  demonstrate 
their  performance. 

In  Chapter  6,  we  studied  methods  for  learning  to  communicate  and  act  in  cooperative 
multi-agent  systems  using  hierarchical  reinforcement  learning.  The  key  idea  underlying 
our  approach  is  that  coordination  skills  are  learned  much  more  efficiently  if  agents  have  a 
hierarchical  representation  of  the  task  structure.  The  use  of  hierarchy  speeds  up  learning 
in  multi-agent  domains  by  making  it  possible  to  leam  coordination  skills  at  the  level  of 
subtasks  instead  of  primitive  actions.  A  further  advantage  of  this  approach  over  flat  learning 
methods  is  that,  since  high-level  subtasks  take  a  long  time  to  complete,  communication  is 
needed  fairly  infrequently.  We  proposed  two  new  cooperative  multi-agent  HRL  algorithms, 
Cooperative  HRL  and  COM-Cooperative  HRL  using  the  above  idea.  In  both  algorithms, 
agents  are  homogeneous,  i.e.,  use  the  same  task  decomposition,  learning  is  decentralized 
and  each  agent  learns  three  interrelated  skills:  how  to  perform  subtasks,  which  order  to  do 
them  in,  and  how  to  coordinate  with  other  agents. 

In  Cooperative  HRL ,  we  assume  communication  is  free  and  therefore  agents  do  not 
need  to  decide  if  communication  with  their  teammates  is  necessary.  We  demonstrated 
the  efficacy  of  this  algorithm  using  a  four-agent  AGV  scheduling  problem.  We  compared 
the  performance  of  the  Cooperative  HRL  algorithm  with  other  algorithms  such  as  selfish 
multi-agent  HRL,  single-agent  HRL,  and  flat  Q-learning  in  this  domain.  We  also  showed 
that  Cooperative  HRL  outperforms  widely  used  industrial  heuristics,  such  as  “first  come 
first  serve”,  “highest  queue  first” ,  and  “nearest  station  first” . 

In  COM-Cooperative  HRL ,  we  addressed  the  issue  of  rational  communicative  behavior 
among  autonomous  agents.  The  goal  is  to  leam  both  action  and  communication  policies 
that  together  optimize  the  task  given  the  communication  cost.  This  algorithm  is  an  exten¬ 
sion  of  Cooperative  HRL  by  including  communication  decisions  in  the  model.  We  studied 
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the  empirical  performance  of  the  COM-Cooperative  HRL  algorithm  as  well  as  the  relation 
between  the  communication  cost  and  the  communication  policy  using  a  multi-agent  taxi 
problem. 


7.2  Future  Work 

There  are  a  number  of  directions  for  future  work  which  are  briefly  outlined. 

Hierarchical  Average  Reward  Reinforcement  Learning 

An  immediate  question  that  arises  is  proving  the  asymptotic  convergence  of  the  algo¬ 
rithms  proposed  in  Chapter  4  to  hierarchically  and  recursively  optimal  average  reward  poli¬ 
cies.  These  results  should  provide  some  theoretical  validity  to  these  algorithms,  in  addition 
to  their  empirical  effectiveness  demonstrated  in  Chapter  4.  Studying  other  local  optimality 
criteria  for  subtasks  in  a  hierarchy  is  an  interesting  problem  that  needs  to  be  addressed.  It 
helps  to  develop  more  effective  recursively  optimal  average  reward  RL  algorithms.  It  is 
also  obvious  that  many  other  manufacturing  and  robotics  problems  can  benefit  from  the 
algorithms  proposed  in  Chapter  4. 

Hierarchical  Policy  Gradient  Reinforcement  Learning 

The  algorithms  proposed  in  Chapter  5  are  based  on  the  assumption  that  the  overall  task 
(i root  of  the  hierarchy)  is  episodic.  One  direction  for  future  work  is  to  reformulate  these 
algorithms  for  the  case  when  the  overall  task  is  continuing.  In  this  case,  the  root  task  is 
formulated  as  a  continuing  problem  with  the  average  reward  as  its  performance  function. 
Since  the  policy  learned  at  root  involves  policies  of  its  children,  the  type  of  optimality 
achieved  at  root  depends  on  how  we  formulate  other  subtasks  in  the  hierarchy.  Different 
notions  of  optimality  in  hierarchical  average  reward  presented  in  Chapter  4  can  be  used  to 
develop  new  HPGRL  algorithms  for  continuing  problems. 

Although  the  algorithms  proposed  in  Chapter  5  give  us  the  ability  to  deal  with  con¬ 
tinuous  state  and/or  continuous  action  spaces,  they  are  not  still  appropriate  for  real-world 
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problems  in  which  the  speed  of  learning  is  crucial.  The  results  of  the  ship  steering  task  in¬ 
dicate  that  in  order  to  apply  the  proposed  algorithms  to  real-world  domains,  more  powerful 
PGRL  algorithms  are  needed  —  algorithms  that  need  a  smaller  number  of  samples,  and  are 
less  computationally  expensive. 

Hierarchical  Multi- Agent  Reinforcement  Learning 

An  immediate  question  that  arises  is  to  define  the  classes  of  cooperative  multi-agent 
problems  in  which  the  algorithms  proposed  in  Chapter  6  converge  to  a  good  approximation 
of  optimal  policy.  The  experiments  of  this  chapter  show  that  the  effectiveness  of  these  al¬ 
gorithms  is  most  apparent  in  tasks  where  agents  rarely  interact  at  low  levels  (for  example 
in  the  trash  collection  task,  two  robots  may  rarely  need  to  exit  through  the  same  door  at 
the  same  time).  However,  the  algorithms  can  be  generalized  and  adapted  to  constrained 
environments  where  agents  are  constantly  running  into  one  another  (for  example  ten  robots 
in  a  small  room  all  trying  to  leave  the  room  at  the  same  time)  by  extending  cooperation 
to  lower  levels  of  the  hierarchy.  This  will  result  in  a  much  larger  set  of  action  values  that 
need  to  be  learned,  and  consequently  learning  will  be  much  slower,  as  shown  in  the  AGV 
experiment  depicted  in  Figure  6.8.  A  number  of  extensions  would  be  useful,  from  study¬ 
ing  the  scenario  where  agents  are  heterogeneous,  to  recognizing  the  high-level  subtasks 
being  performed  by  other  agents  using  a  history  of  observations  (plan  recognition  and  ac¬ 
tivity  modeling)  instead  of  direct  communication.  In  the  later  case,  we  assume  that  each 
agent  can  observe  its  teammates  and  uses  its  observations  to  extract  their  high-level  sub¬ 
tasks.  Good  examples  for  this  approach  are  games  such  as  soccer,  football  or  basketball,  in 
which  players  often  extract  the  strategy  being  performed  by  their  teammates  using  recent 
observations  instead  of  direct  communication.  Saria  and  Mahadevan  (2004)  presented  a 
theoretical  framework  for  online  probabilistic  plan  recognition  in  cooperative  multi-agent 
systems.  Their  model  extends  the  abstract  hidden  Markov  model  (AHMM)  (Bui  et  al., 
2002)  to  cooperative  multi-agent  domains.  We  believe  that  the  model  presented  by  Saria 
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and  Mahadevan  can  be  combined  with  the  learning  algorithms  proposed  in  Chapter  6  to 
reduce  communication  by  learning  to  recognize  the  high-level  subtasks  being  performed 
by  the  other  agents. 

Another  direction  for  future  work  in  this  area  is  to  study  different  termination  schemes 
for  composing  temporally  extended  actions.  We  used  Tcontinue  termination  strategy  in  the 
algorithms  proposed  in  Chapter  6.  However,  it  would  be  beneficial  to  investigate  Tany  and 
Tau  termination  schemes  in  our  model.  Many  other  manufacturing  and  robotics  problems 
can  benefit  from  these  algorithms.  Combining  these  algorithms  with  function  approxi¬ 
mation  and  factored  action  models,  which  makes  them  more  appropriate  for  continuous 
state  problems,  is  also  an  important  area  of  research.  In  this  direction,  we  believe  that  the 
algorithms  proposed  in  Chapter  6  can  be  combined  with  the  hierarchical  policy  gradient 
algorithms  proposed  in  Chapter  5  to  be  used  in  multi-agent  domains  with  continuous  state 
and/or  action.  Finally,  studying  those  communication  features  that  have  not  been  consid¬ 
ered  in  our  model  such  as  message  delay  and  probability  of  loss  is  another  fundamental 
problem  that  needs  to  be  addressed. 

7.3  Closing  Remarks 

In  this  dissertation,  we  exploit  domain-specific  properties  to  design  more  efficient  HRL 
algorithms.  These  algorithms  extend  HRL  to  solving  complex  sequential  decision  mak¬ 
ing  problems  such  as  those  with  continuous  state  and/or  action  spaces  and  domains  with 
multiple  cooperative  agents.  However,  many  issues  remain  to  be  studied  before  learning 
methods  can  be  deployed  in  practical  settings.  In  this  chapter,  we  outlined  a  few  open  direc¬ 
tions  that  are  particularly  related  to  the  methods  developed  in  this  dissertation.  Of  course, 
there  are  many  other  more  general  open  questions  that  must  be  addressed  before  effective 
learning  techniques  can  be  designed  for  tackling  large-scale  complex  systems.  Ultimately, 
we  hope  that  such  learning  methods  will  aid  human  users  in  solving  complex  problems, 
which  require  learning  and  adaptation. 
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APPENDIX 


INDEX  OF  SYMBOLS 


Here  we  present  a  list  of  the  symbols  used  in  this  dissertation  to  hopefully  alleviate  the 
difficulty  for  the  reader,  or  at  least  provide  a  handy  reference. 


Notation 

Defi  nition 

B 

set  of  real  numbers 

IN 

set  of  natural  numbers 

E 

expected  value 

M 

an  MDP  model 

S 

set  of  states 

A 

set  of  actions 

-4s 

set  of  admissible  actions  in  state  s 

V 

transition  probability  function  in  MDP  and  multi-step  transition  probability 

function  in  SMDP 

P(s'\s,  a) 

probability  that  action  a  causes  transition  from  state  s  to  state  s'  in  an  MDP 

p/i 

transition  probability  matrix  of  policy  //  in  an  MDP 

pLi 

limiting  matrix  of  policy  //  in  an  MDP 

n 

reward  function 

r(s,  a) 

rewar'd  of  taking  action  a  in  state  s 

I 

initial  state  distribution 

I-1 

a  policy 

Ma|s) 

probability  that  policy  /j  selects  action  a  in  state  s 

M* 

optimal  policy 

7 

discount  factor 

a 

learning  rate  parameter 

V ^ 

value  function  of  policy  //  in  (lit  models 

t/M 

hierarchical  value  function  of  hierarchical  policy  fi  in  hierarchical  models 

t/M 

projected  value  function  of  hierarchical  policy  //  in  hierarchical  models 

t/* 

optimal  value  function 

action- value  function  of  policy  /j  in  (lit  models 

hierarchical  action-value  function  of  hierarchical  policy  /i  in  hierarchical  models 

projected  action- value  function  of  hierarchical  policy  //  in  hierarchical  models 

Q* 

optimal  action-value  function 

p* 

Bellman  operator 
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Notation 

Defi  nition 

gH 

average  reward  or  gain  of  policy  fi 

global  gain  under  hierarchical  policy  //, 

LI 

Si 

local  gain  of  subtask  Mi  under  hierarchical  policy  /r, 

9* 

gain  of  optimal  policy 

H* 

average-adjusted  value  function  of  policy  //  in  Hit  models 

H» 

hierarchical  average-adjusted  value  function  of  hierarchical  policy  //, 
in  hierarchical  models 

H » 

projected  average-adjusted  value  function  of  hierarchical  policy  /i 
in  hierarchical  models 

H* 

optimal  average-adjusted  value  function 

average-adjusted  action-value  function  of  policy  //  in  (lit  models 

hierarchical  average-adjusted  action-value  function  of  hierarchical 
policy  fi  in  hierarchical  models 

L» 

projected  average-adjusted  action-value  function  of  hierarchical 
policy  fi  in  hierarchical  models 

L* 

optimal  average-adjusted  action-value  function 

P(s',N\s,a) 

probability  that  action  a  will  cause  the  system  to  transition  from 
state  s  to  state  s'  in  N  time  steps 

m(s/|s,  a) 

probability  that  an  SMDP  occupies  state  s'  at  the  next  decision  epoch 
given  that  the  agent  takes  action  a  in  state  s  at  the  current  decision  epoch 

m ^ 

transition  probability  matrix  of  the  embedded  Markov  chain  of  an  SMDP 

for  policy  // 

fh ^ 

limiting  matrix  of  the  embedded  Markov  chain  of  an  SMDP  for  policy  /t 

y{s,a) 

expected  number  of  transition  steps  until  the  next  decision  epoch 

n 

a  hierarchy 

Mi 

subtask  Mi  in  a  hierarchy 

Si 

set  of  states  for  subtask  Ml  in  a  hierarchy 

\Si\ 

cardinality  of  set  of  states  St 

Ai 

set  of  actions  for  subtask  Mt  in  a  hierarchy 

Ri 

reward  function  for  subtask  Mi  in  a  hierarchy 

li 

initiation  set  for  subtask  Mr  in  a  hierarchy 

Ti 

termination  set  for  subtask  Mt  in  a  hierarchy 

sTi 

a  terminal  state  of  subtask  M,  in  a  hierarchy  ;  sp  £  T, 

Hi 

a  policy  for  subtask  Mj  in  a  hierarchy 

P 

a  hierarchical  policy 

pM 

1  i 

multi-step  transition  probability  function  of  subtask  Mj 

P?(S',N\s) 

probability  that  action  m(s)  causes  transition  from  state  s  to 
state  s'  in  N  primitive  steps  under  hierarchical  policy  // 

T7lJ 

i 

multi-step  abstract  transition  probability  function  of  subtask  Mi 

F?(s',N\s) 

probability  of  transition  from  state  s  to  state  s'  in  N  abstract  actions  taken 
by  subtask  M,  under  hierarchical  policy  /i 

pfl 

single-step  transition  probability  function  under  hierarchical  policy  /i 

P»(s'\s) 

probability  that  hierarchical  policy  /i  will  cause  the  system  to  transition 
from  state  s  to  state  s'  at  the  level  of  primitive  actions 

u. 

transition  probability  matrix  of  the  Markov  chain  at  subtask  Mj  for 
hierarchical  policy  /r. 
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Notation 

Defi  nition 

mf 

limiting  matrix  of  the  Markov  chain  at  subtask  Mi  for  hierarchical  policy  /i 

n 

set  of  possible  values  for  Task-Stack  in  a  hierarchy 

x  =  n  x  s 

joint  state  space  of  Task-Stack  values  and  states  in  a  hierarchy 

X  =  (cu,  s ) 

joint  state  value  x  formed  by  Task-Stack  value  u  and  state  value  s  in  a 

hierarchy 

LO  /*  i 

popping  subtask  Mi  off  Task-Stack  with  content  uj  in  a  hierarchy 

i\u> 

pushing  subtask  Mi  onto  Task-Stack  with  content  a;  in  a  hierarchy 

C * 

completion  function  of  hierarchical  policy  fi 

CE * 

external  completion  function  of  hierarchical  policy  /i 

7 rA‘ 

steady  state  probability  vector  of  the  Markov  chain  defi  ned  by  policy  /t 

7 rM(s) 

steady  state  probability  of  being  in  state  s  for  the  Markov  chain  defi  ned  by 

policy  fi 

e 

set  of  policy  parameters 

o, 

set  of  policy  parameters  for  subtask  Mi 

Hi  (@i) 

policy  for  subtask  Mi  corresponding  to  parameter  vector  0, 

H{0) 

hierarchical  policy  corresponding  to  parameter  vector  0 

Xi(0) 

weighted  reward-to-go  of  subtask  M,  under  hierarchical  policy 
parameterized  by  parameter  set  6 

Ji(s;6) 

reward-to-go  of  subtask  Mi  in  state  s  under  hierarchical  policy 
parameterized  by  parameter  set  6 

T 

set  of  a  fi  nite  collection  of  agents  in  multi-agent  SMDP 
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